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Abstract — We present concept and implementation of a 
self-stabilizing Byzantine fault-tolerant distributed clock gen- 
eration scheme for multi-synchronous GALS architectures 
in critical applications. It combines a variant of a re- 
cently introduced self-stabilizing algorithm for generating low- 
frequency, low-accuracy synchronized pulses with a simple non- 
stabilizing high-frequency, high-accuracy clock synchronization 
algorithm. We provide thorough correctness proofs and a 
performance analysis, which use methods from fault-tolerant 
distributed computing research but also addresses hardware- 
related issues like metastability. The algorithm, which con- 
sists of several concurrent communicating asynchronous state 
machines, has been implemented in VHDL using Petrify in 
conjunction with some extensions, and synthetisized for an 
Altera Cyclone FPGA. An experimental validation of this 
prototype has been carried out to confirm the skew and clock 
frequency bounds predicted by the theoretical analysis, as well 
as the very short stabilization times (required for recovering 
after excessively many transient failures) achievable in practice. 

I. Introduction 

To circumvent the cumbersome clock tree engineering 
issue |[T], ||2l, Q, m, systems-on-chip (SoC) are nowa- 
days increasingly designed globally asynchronous locally 
synchronous (GALS) [5|. Using independent and hence 
unsynchronized clock domains requires asynchronous cross- 
domain communication mechanisms or synchronizers f6\, 
Q, jfel, however, which inevitably create the potential for 
metastabiUty ||9l- This problem can be circumvented by 
means of multi-synchronous clocking fTOl, flT], which guar- 
antees a certain degree of synchrony between clock domains. 
Multi-synchronous GALS is particularly beneficial from a 
designer's point of view, since it combines the convenient 
local synchrony of a GALS system with a global time base 
across the whole chip, including the ability for metastability- 
free high-speed communication across clock domains |12|. 

The decreasing feature sizes of deep submicron VLSI 
technology also resulted in an increased likelihood of 
chip components failing during operation: Reduced voltage 
swings and smaller critical charges make cuxuits more 



susceptible to ionized particle hits, crosstalk, and electro- 
magnetic interference lIIU, lE), HSl, US, lUll, US). Fault- 
tolerance hence becomes an increasingly pressing issue 
also for chip design. Unfortunately, faulty components may 
behave non-benign in many ways: They may perform signal 
transitions at arbitrary times and even convey inconsistent 
information to their successor components if their outputs 
are affected by a failure. Well-known theory on fault-tolerant 
agreement and synchronization shows that this behaviour is 
the key feature of unrestricted, i.e., Byzantine faults (19]. 
This forces to model faulty components as Byzantine if a 
high fault coverage is to be guaranteed. 

Unfortunately, lower-bound results (19], (20] reveal that, 
in order to cope with some maximum number / of Byzantine 
faulty components (say, processors) throughout an execution 
of a system, ri > 3/ + 1 components are required. Given 
the typically transient nature of failures in digital circuits, 
these bounds reveal that even a Byzantine fault-tolerant 
system cannot be expected to recover from a situation where 
more than / components became faulty transiently, since 
their state may be corrupted. Dealing with this problem is 
in the realm of self-stabilizing algorithms f2T\ . which are 
guaranteed to recover even if each and every component 
of the system fails arbitrarily, but later on works according 
to its specification again: in that case the system resumes 
correct operation after some stabilization time following 
the instant when no more failures occur. Byzantine-tolerant 
self-stabilizing algorithms (^, (23], (24], (25], (26], f27l, 
||28l combine the best of both worlds, by guaranteeing both 
correct operation and self-stabilization in the presence of up 
to / Byzantine faulty components in the system. 

This paper presents concept and prototype implementation 
of a novel approach, termed FATAL+, for multi-synchronous 
clocking in GALS systems. It relies on a self-stabilizing and 
Byzantine fault-tolerant distributed algorithm, consisting of 
n identical instances (called nodes), which generate n local 
clock signals (one for each clock domain) with the following 
properties: Bounded skew, i.e., bounded maximum time 



between the k-th clock transitions of any two clock signals 
of correct nodes, and bounded accuracy (i.e., frequency), 
i.e., bounded minimum and maximum time between the 
occurence of any two successive clock transitions of the 
clock signal at any correct node. At most / < 7i/3 nodes 
may behave Byzantine faulty, in which case their clock sig- 
nals may be arbitrary. The whole algorithm can be directly 
implemented in hardware, without quartz oscillators, using 
standard asynchronous logic gates only. 

FATAL+ self-stabilizes within 0{kn) time with proba- 
bility 1 — 2"'^^""-''' (with constant expectation in typical 
settings), and is metastability-free by construction after sta- 
bilization in failure-free runsQif the number of faults is not 
overwhelming, i.e., a majority of at least n — f nodes contin- 
ues to execute the protocol in an orderly fashion, recovering 
nodes and late joiners (re) synchronize deterministically in 
constant time. 

Detailed contributions: (1) In Sections [lI - VI we present 
concept and theoretical analysis of FATAL+, which is based 
on a variant of the randomized self-stabilizing Byzantine- 
tolerant pulse synchronization algorithm [28| we recently 
proposed. It eventually generates synchronized periodic 
pulses with moderate skew and low frequency, and improves 
upon the results from [281 in that it tolerates arbitrarily large 
clock drifts and allows late joiners or nodes recovering from 
transient faults to deterministically resynchronize within 
constant time. The formal proof of these properties builds 



upon and extends the analysis in |30|. In Section VI this 



algorithm is integrated with a Byzantine-tolerant but non- 
self-stabilizing tick generation algorithm based on Srikanth 
& Touegs clock synchronization algorithm ||3T| , operating 
in a control loop: The latter, referred to as the quick cycle 
algorithm, generates clock ticks with high frequency and 
small skew, which also (weakly) affect pulse generation. On 
the other hand, quick cycle uses pulses to monitor its ticks 
in order to detect the need for stabilization. 

(2) In [Section VII[ we present the major ingredients of 
an Altera Cyclone IV FPGA protoype implementation of 
FATAL+. It primarily consists of multiple hybrid (asyn- 
chronous + synchronous) state machines, which have been 
generated semi-automatically from the specification of the 
algorithms using Petrify |32|. Non-standard extensions were 
needed for ensuring deadlock-free communication despite 
arbitrarily many desynchronized nodes, some of which could 
be Byzantine faulty, which e.g. forced us to use state-based 
communication instead of handshake-based communication. 
Special care had also to be exercised for ensuring self- 
stabilizing elementary building blocks and metastability- 
freedom in normal operation (after stabilization). 

(3) In [Section VIII| we provide some results of the 

' It is easy to see that, metastable upsets cannot be ruled out in executions 
involving Byzantine faults. However, they can be made as unlikely as 
desired by using synchronizers or elastic pipelines acting as metastability 
filters I29i . 



experimental evaluation of our prototype implementation. 
They demonstrate the feasibility of FATAL+ and confirm 
the results of our theoretical analysis, in particular, a tight 
skew bound, in the presence of Byzantine faulty nodes. 
Special emphasis has been put on experiments validating 
the predictions related to stabilization time, which revealed 
that the system indeed stabilizes in very short time from any 
initial/error state. 

[Section IXj eventually concludes our paper. 

Related work: The work 1(33, |[34l, lESl, ESI on dis- 
tributed clock generation in VLSI circuits is essentially 
based on (distributed) ring oscillators, formed by regular 
structures (rings, meshes) of multiple inverter loops. Since 
clock synchronization theory |20| reveals that high connec- 
tivity is required for bounded synchronization tightness in 
the presence of failures, these approaches are fundamentally 
restricted in that they can overcome at most a small constant 
number of Byzantine failures. 

The only exception we are aware of is the DARTS fault- 
tolerant clock generation approach 1371 . OSl . which also 
adresses multi-synchronous clocking in GALS systems. Like 
FATAL+, DARTS is based on a fault-tolerant distributed 
algorithm 1391 implemented in asynchronous digital logic. 
Although it shares many features with FATAL+, including 
Byzantine fault-tolerance, it is not self-stabilizing: If more 
than / nodes ever become faulty, the system will not 
recover even if all nodes work correctly thereafter Moreover, 
in DARTS, simple transient faults such as radiation- or 
crosstalk-induced additional (or omitted) clock ticks accu- 
mulate over time to arbitrarily large skews in an other- 
wise benign execution. Despite not suffering from these 
drawbacks, FATAL+ offers similar guarantees in terms of 
area consumption, clock skew, and amortized frequency as 

DARTS. 

Furthermore, a number of Byzantine-tolerant self-stabiU- 
zing clock synchronization protocols f22], f23], l24ll . l25l . 
l26l . ll27l have been devised by the distributed systems 
community. Beyond optimal resilience, an attractive feature 
of most of these protocols is a small stabilization time. 
However, all of them exhibit deficiencies rendering them 
unsuitable in the VLSI context. This motivated to devise 
the algorithm from f28l, [30|, an improved variant of which 
forms the basis of FATAL+ . 

II. Model 

In this section we introduce our system model. Our formal 
framework will be tied to the peculiarities of hardware de- 
signs, which consist of modules that continuously compute 
their output signals based on their input signals. 

^In sharp contrast to classic distributed computing models, there is no 
computationally complex discrete zero-time state-transition here. 



Signals 

Following BOl . Pn . we define (the trace of) a signal to 
be a timed event trace over a finite alphabet S of possible 
signal states: Formally, signal cr C S x M.^ . All times and 
time intervals refer to a global reference time taken from M(j", 
that is, signals reflect the system's state from time on. The 
elements of a are called events, and for each event (s, t) we 
call s the state of event (s,t) and t the f/me of event {s,t). 
In general, a signal a is required to fulfill the following 
conditions: (i) for each time interval [t^ ,t^] C of finite 
length, the number of events in a with times within ,t^] 
is finite, (ii) from (s,t) G a and (s',t) G cr follows that 
s — s', and (iii) there exists an event at time in a. 

Note that our definition allows for events (s, t) and 
{s,t') G a, where t < t', without having an event {s',t") G 
a with s' s and t < t" < t'. In this case, we call event 
(s, t') idempotent. Two signals a and a' are equivalent, iff 
they differ in idempotent events only. We identify all signals 
of an equivalence class, as they describe the same physical 
signal. Each equivalence class [cr] of signals contains a 
unique signal ctq having no idempotent events. We say that 
signal a switches to s at time t iff event (s,t) € ctq. 

The state of signal a at time t e M,]', denoted by a{t\ 
is given by the state of the event with the maximum time 
not greater than Because of (i), (ii) and (iii), cr(t) is 
well defined for each time t e Mj. Note that cr's state 
function in fact depends on [cr] only, i.e., we may add or 
remove idempotent events at will without changing the state 
function. 

Distributed System 

On the topmost level of abstraction, we see the system 
as a set of y = {1, . . . ,n} physically remote nodes that 
communicate by means of channels. In the context of a VLSI 
circuit, "physically remote" actually refers to quite small 
distances (centimeters or even less). However, at gigahertz 
frequencies, a local state transition will not be observed 
remotely within a time that is negligible compared to clock 
speeds. We stress this point, since it is crucial that different 
clocks (and their attached logic) are not placed too close 
to each other, as otherwise they might fail due to the same 
event such as a particle hit. This would render it pointless 
to devise a system that is resilient to a certain fraction of 
the nodes failing. 

Each node i comprises a number of input ports, namely 
Sij for each node j, an output port Si, and a set of local 
ports, introduced later on. An execution of the distributed 
system assigns to each port of each node a signal. For 
convenience of notation, for any port p, we refer to the signal 
assigned to port p simply by signal p. We say that node i is 

^To facilitate intuition, we here sliglitly abuse notation, as this way a 
denotes both a function of time and the signal (trace), which is a subset of 
§ X . Whenever refening to <t, we will talk of the signal, not the state 
function. 



in State s at time t iff Si{t) = s. We further say that node i 
switches to state s at time t iff signal Si switches to s at 
time t. 

Nodes exchange their states via the channels between 
them: for each pair of nodes i, j, output port Si is connected 
to input port Sj,i by a FIFO channel from i to j. Note 
that this includes a channel from i to i itself. Intuitively, 
Si being connected to Sj_i by a (non-faulty) channel means 
that Sj^i{-) should mimic Si{-), however, with a slight delay 
accounting for the time it takes the channel to propagate 
events. In contrast to an asynchronous system, this delay is 
bounded by the maximum delay d > 0|j 

Formally we define: The channel from node i to j is 
said to be correct during [i^,t+] iff there exists a function 
Tij : M.Q Rq, called the channel's delay function, such 
that: (i) Ti j is continuous and strictly increasing, (ii) Vt S 
[max(t^,Ti.j (0)),i+] : < t-T~^{t) < d, and (iii) for each 
t e [max(r,T,j(0)),t+], {s,t) e % ^ (s,Tr/(f)) g 
Si, and for each t E [t^ ,Ti,j{0)), (s,t) e Sj^i s = 
Si{0). Note that because of (i), t^"^ exists in the domain 
[Ti j(0),oo), and thus (ii) and (iii) are well defined. We say 
that node i observes node j in state s at time t if Sij (t) = s. 

Clocks and Timeouts 

Nodes are never aware of the current reference time 
and we also do not require the reference time to resemble 
Newtonian "real" time. Rather we allow for physical clocks 
that run arbitrarily fast or slowj^ as long as their speeds are 
close to each other in comparison. One may hence think 
of the reference time as progressing at the speed of the 
currently slowest correct clock. In this framework, nodes 
essentially make use of bounded clocks with bounded drift. 

Formally, clock rates are within [1.,'d] (with respect to 
reference time), where ?9 > 1 is constant and z9 — 1 is the 
(maximum) clock drift. A clock C is a continuous, strictly 
increasing function C : — > M.'^ mapping reference time 
to some local time. Clock C is said to be correct during 
[t^,t+] C M+ iff we have for any t,t' e [t^ ,t+], t < t' , 
that t' -t < C{t')-C{t) < ■d{t' -t). Each node comprises a 
set of clocks assigned to it, which allow the node to estimate 
the progress of reference time. 

Instead of directly accessing the value of their clocks, 
nodes have access to so-called timeout ports of watchdog 
timers. A timeout is a triple (T, s, C), where T e is a 
duration, s G § is a state, and C is some local clock (there 
may be several), say of node i. Each timeout (T, s, C) has a 
corresponding timeout port TimeT,s.c^ being part of node i's 
local ports. Signal TimeT.s,c is Boolean, that is, its possible 
states are from the set {0, 1}. We say that timeout (T, s, C) 

''with respect to O-notation, we normalize d £ 0(1), as all time bounds 
simply depend linearly on d. 

'Note that the formal definition excludes trivial solutions by requiring 
clocks' progress to be in a linear envelope of the reference time, see below. 



is correct during [t , t+] C M.'^ iff clock C is correct during 
[i^,<+] and the following holds: 

1) For each time tg € when node i switches 
to state s, there is a time t G [ts, Ti_i(is)] such that 
{T,s,C) is rese?, i.e., {0,t) G TimeT,s,c- This is a 
one-to-one correspondence, i.e., (T, s, C) is not reset 
at any other times. 

2) For a time t e [t^ , denote by to the supremum of 
all times from [t^,t] when {T,s,C) is reset. Then it 
holds that {l,t) € Timer.s.c iff C'(t) - C(to) = T. 
Again, this is a one-to-one correspondence. 

We say that timeout (T, s, C) expires at time i iff 
Timc^ s c switches to 1 at time t, and it is expired at time 
t iff TimeT,s,c(i) = 1- For notational convenience, we will 
omit the clock C and simply write (T, s) for both the timeout 
and its signal. 

A randomized timeout is a triple (2?, s, C), where V is 
a bounded random distribution on Rq, s e § is a state, 
and C is a clock. Its corresponding timeout port Timep^s.c 
behaves very similar to the one of an ordinary timeout, 
except that whenever it is reset, the local time that passes 
until it expires next — ^provided that it is not reset again 
before that happens — follows the distribution V. Formally, 
(V^SjC) is correct during [t^,t^] C R^, if C is correct 
during [t^,t^] and the following holds: 

1) For each time tg £ [t^,t+] when node i switches 
to state s, there is a time t e ^(ts)] such that 
(T>,s,C) is reset, i.e., {0,t) E Time-D ^.c'- This is a 
one-to-one correspondence, i.e., {V, s, C) is not reset 
at any other times. 

2) For a time t E [t^,t^], denote by to the supremum 
of all times from [t^,t\ when {'D,s,C) is reset. Let 
/Lt : ^- denote the density of V. Then (1, t) E 
Timers, c "with probability /i(C(t) — C(to))" and we 
require that the probability of {l,t) E Timex). s,c — 
conditional to to and C on [to,t] being given — is 
independent of the system's state at times smaller than 
t. More precisely, if superscript £ identifies variables 
in execution £ and t'o is the infimum of all times 
from (^Oi^^l when node i switches to state s, then 
we demand for any [T^,r+] C [to, t'o] that 
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fi{C{T) - C{to)) dT, 

independently of 

We will apply the same notational conventions to random- 
ized timeouts as we do for regular timeouts. 

Note that, strictly speaking, this definition does not induce 
a random vaiiable describing the time t' E [to, t'o) satisfying 



that {l,t') E Time-p, s.c- However, for the state of the 
timeout port, we get the meaningful statement that for any 

t' E [to,t'o), 

P[Timex>.s,c switches to 1 during [ioi^']] 
fi{C{T) - C{to)) dr. 



The reason for phrasing the definition in the above more 
cumbersome way is that we want to guarantee that an 
adversary knowing the full present state of the system and 
memorizing its whole history cannot reliably predict when 
the timeout will expire]^ 

We remark that these definitions allow for different time- 
outs to be driven by the same clock, implying that an 
adversary may derive some information on the state of a 
randomized timeout before it expires from the node's behav- 
ior, even if it cannot directly access the values of the clock 
driving the timeout. This is crucial for implementability, as 
it might be very difficult to guarantee that the behavior of a 
dedicated clock that drives a randomized timeout is indeed 
independent of the execution of the algorithm. 

Memory Flags 

Besides timeout and randomized timeout ports, another 
kind of node i's local ports are memory flags. For each state 
s e S and each node j E V, Mem^ j.s is a local port of 
node i. It is used to memorize whether node i has observed 
node j in state s since the last reset of the flag. We say that 
node i memorizes node j in state s at time t if Mem^ ^ ^(t) = 
1. Formally, we require that signal Mem^ s switches to 1 
at time t iff node i observes node j in state s at time t 
and Mem^ is not akeady in state 1. The times t when 
Mem^j-.s is reset, i.e., (0,t) E Mem^.j^s, are specified by 
node i's state machine, which is introduced next. 

State Machine 

It remains to specify how nodes switch states and when 
they reset memory flags. We do this by means of state 
machines that may attain states from the finite alphabet S. 
A node's state machine is specified by (i) the set S, (ii) a 
function tr, called the transition function, from T C to 
the set of Boolean predicates on the alphabet consisting of 
expressions "p — s" (used for expressing guards), where p 
is from the node's input and local ports and s is from the set 
of possible states of signal p, and (iii) a function re, called 
the reset function, from T to the power set of the node's 
memory flags. 

Intuitively, the transition function specifies the conditions 
(guards) under which a node switches states, and the reset 

*This is a non-trivial property. For instance nodes could just determine, 
by drawing from the desired random distribution at time t o , at which local 
clock value the timeout shall expire next. This would, however, essentially 
reveal when the timeout will expire prematurely, greatly reducing the power 
of randomization! 



function determines which memory flags to reset upon the 
state change. Formally, let P be a predicate on node i's input 
and local ports. We define P holds at time t by structural 
induction: If P is equal to p = s, where p is one of node 
j's input and local ports and s is one of the states signal p 
can obtain, then P holds at time t iff p{t) = s. Otherwise, 
if P is of the form ^Pi, Pi A P2, or Pi V P2, we define P 
holds at time t in the straightforward manner. 

We say node i follows its state machine during 
iff the following holds: Assume node i observes itself in 
state s G S at time t G i.e., Si^i{t) = s. Then, for 

each (s, s') € T, both: 

1) Node i switches to state s' at time t iff tr{s, s') holds 
at time t and i is not already in state s'j^ (In case 
more than one guard tr{s, s') can be true at the same 
time, we assume that an arbitrary tie-breaking ordering 
exists among the transition guards that specifies to 
which state to switch.) 

2) Node i resets memory flag m at some time in the 
interval [t,Ti^i{t)] iff m e re{s,s') and i switches 
from state s to state s' at time t. This correspondence 
is one-to-one. 

A node is defined to be non-faulty during iff 
during [t^ , t+] all its timeouts and randomized timeouts 
are correct and it follows its state machine. If it employs 
multiple state machines (see below), it needs to follow all 
of them. 

In contrast, a faulty node may change states arbitrarily. 
Note that while a faulty node may be forced to send con- 
sistent output state signals to all other nodes if its channels 
remain correct, there is no way to guarantee that this stiU 
holds true if channels are faulty]^ 

Metastability 

While the presented model does not fully capture prop- 
agation and decay of metastable upsets, i.e., the propaga- 
tion of intermediate values through combinational circuit 
elements, and the probability distributions on the decay of 
metastable upsets, it allows to capture its generation. An 
algorithm is inherently susceptible to metastability by the 
lacking capability of state machines to instantaneously take 
on new states: Node i decides on state transitions based on 
the delayed status of port 5'^ ^ instead of its "true" current 
state Si. Consider the following example: Node i is in state s 
at some time t, but since it switched to s only very recently, 
it still observes itself in state s' ^ s at time t. A metastable 
upset might occur at time t (i) if the guard tr{s', s) falls back 

'Recall that a node may still observe itself in state s albeit already having 
switched to s' . 

**A single physical fault may cause this behavior, as at some point a 
node's output port must be connected to remote nodes' input ports. Even 
if one places bifurcations at different physical locations striving to mitigate 
this effect, if the voltage at the output port drops below specifications, the 
values of corresponding input channels may deviate in unpredictable ways. 



to false at time t, or (ii) if there is another transition (s', s") 
in T whose guard becomes true at time t. The treatment of 



scenario (i) is postponed to Section VII where it is discussed 
together with the implementation of a node's components. 
Scenario (ii) is accounted for in the following definition: 

Definition 2.1 (Metastability-Freedom): We denote state 
machine AI of node i as being metastability -free during 
, iff for each time t S [t~ , t^] when M switches from 
some state s to another state s', it holds that Ti,i{t) < t', 
where t' is the infimum of all times in (t, i+] when M 
switches to some state s". 



Multiple State Machines 

In some situations the previous definitions are too strin- 
gent, as there might be different "components" of a node's 
state machine that act concurrently and independently, 
mostly relying on signals from disjoint input ports or or- 
thogonal components of a signal. We model this by per- 
mitting that nodes run several state machines in parallel. 
All these state machines share the input and local ports of 
the respective node and are required to have disjoint state 
spaces. If node i runs state machines Mi, . . . ,Mk, node 
i's output signal is the product of the output signals of the 
individual machines. Formally we define: Each of the state 
machines Mj, I < j < k, has an additional own output 
port Sj. The state of node i's output port Si at any time t 
is given by Si{t) := {si{t), . . . , Sk{t)), where the signals of 
ports si, . . . ,Sk are defined analogously to the signals of the 
output ports of state machines in the single state machine 
case. Note that by this definition, the only (local) means for 
node i's state machines to interact with each other is by 
reading the delayed state signal Si,i. 

We say that node i's state machine Mj is in state s at 
time t iff Sj{t) — s, where Si{t) = (si(t), . . . ,Sfc(i)), and 
that node i 's state machine Mj switches to state s at time t 
iff signal sj switches to s at time t. Since the state spaces 
of the machines Mj are disjoint, we will omit the phrase 
"state machine Mj" from the notation, i.e., we write "node 
i is in state s" or "node i switched to state s", respectively. 

Recall that the various state machines of node i are as 
loosely coupled as remote nodes, namely via the delayed 
status signal on channel Si^i only. Therefore, it makes 
sense to consider them independently also when it comes 
to metastability. 

Definition 2.2 (Metastability-Freedom — Multiple SM's): 
We denote state machine M of node i G V as metastability- 
free during iff for each time t € when M 

switches from some state s e § to another state s' G §, it 
holds that Ti,i{t) < t', where t' is the infimum of all times 
in (t, t+] when M switches to some state s" E §. 

Note that by this definition the different state machines 
may switch states concurrently without suffering from 



metastabilityj^ It is even possible that some state machine 
suffers metastabihty, while another is not affected by this at 
allE3 

Problem Statement 

The purpose of the pulse synchronization protocol is 
that nodes generate synchronized, well-separated pulses by 
switching to a distinguished state accept. Self-stabilization 
requires that it starts to do so within a bounded time, 
for any possible initial state. However, as our protocol 
makes use of randomization, there are executions where 
this does not happen at all; instead, we will show that the 
protocol stabilizes with probability one in finite time. To 
give a precise meaning to this statement, we need to define 
appropriate probability spaces. 

Definition 2.3 (Adversarial Spaces): Denote for i Cz V 
by Ci — (Ci.i, . . . , Ci^ci) the tuple of clocks of node i. 
An adversarial space is a probabilistic space that is defined 
by subsets of nodes and channels W C V and E C V^, 
a time interval a protocol V (nodes' ports, state 

machines, etc.) as previously defined, tuple of all clocks 
(Ci,...,C„), a function Q assigning each ^ a 

delay Ti j : M.^ — > M.^, an initial state Sq of all ports, and 
an adversarial function A. Here A is a function that maps 
a partial execution £|[o,t] until time t (i.e., all ports' values 
until time t), W, E, [t^,t+], V, C, and 9 to the states of all 
faulty ports during the time interval (t, t'], where t' is the 
infimum of all times greater than t when a non-faulty node 
or channel switches states. 

The adversarial space AS{W,E, [f- ,t+],'P,C,e,£o,A) 
is now defined on the set of all executions £ satisfying that 
(i) the initial state of all ports is given by f |[o,o] = ^o- 
(m) for all i £ V and fc € {1, . . . , : Cf^ = Ci^/j, {in) 
for all £ V^, t[j — Tij, {iv) nodes in W are non- 

faulty during with respect to the protocol V, (v) 

all channels in E are correct during ,t^], and (vi) given 
£\[o^t] for ^ny time t, £\{t,t'] is given by A, where t' is 
the infimum of times greater than t when a non-faulty node 
switches states. Thus, except for when randomized timeouts 
expire, £ is fully predetermined by the parameters of y^tSp^ 

'However, care has to be taken when implementing the inter-node 
communication of the state components in a metastability-free manner, 
cf. [Section VII| 

"'This is crucial for the algorithm we are going to present. For sta- 
bilization purposes, nodes comprise a state machine that is prone to 
metastabihty. However, the state machine generating pulses (i.e., having 
the state accept, cf. [Definition 2.4) does not take its output signal into 
account once stabilization is achieved. Thus, the algorithm is metastability- 
free after stabilization in the sense that we guarantee a metastability-free 
signal indicating when pulses occur 

"This follows by induction starting from the initial configuration ^o- 
Using A., we can always extend £ to the next time when a connect node 
switches states, and when non-faulty nodes switch states is fully determined 
by the parameters of ^cS except for when randomized timeouts expire. Note 
that the induction reaches any finite time within a finite number of steps, 
as signals switch states finitely often in finite time. 



The probability measure on AS is induced by the random 
distributions of the randomized timeouts specified by V. 

To avoid confusion, observe that if the clock functions and 
delays do not follow the model constraints during [t^.t+], 
the respective adversarial space is empty and thus of no 
concern. This cumbersome definition provides the means to 
formalize a notion of stabilization that accounts for worst- 
case drifts and delays and an adversary that knows the full 
state of the system up to the current time. 

We are now in the position to formally state the pulse 
synchronization problem in our framework. Intuitively, the 
goal is that after transient faults cease, nodes should with 
probability one eventually start to issue well-separated, syn- 
chronized pulses by switching to a dedicated state accept. 
Thus, as the initial state of the system is arbitrary, specifying 
an algorithrrp] is equivalent to defining the state machines 
that run at each node, one of which has a state accept. 

Definition 2.4 (Self-Stabilizing Pulse Synchronization): 
Given a set of nodes W V and a set C y x y of 
channels, we say that protocol is a [W^E) -stabilizing 
pulse synchronization protocol with skew E and accuracy 
bounds T~ > S and T+ that stabilizes within time T 
with probability p iff the following holds. Choose any time 
interval D [t^ +T+Ti\ and any adversarial space 

AS{W,E,[t-,t+],r, ■,-,■,■) (i-e., C, 9, £o, and A are 
arbitrary). Then executions from AS satisfy with probability 
at least p that there exists a time tg E [t^ +T] so that, 
denoting by ti{k) the time when node i G W switches 
to a distinguished state accept for the /c* time after tg 
{ti{k) = oo if no such time exists), (i) ti{l) £ {ts,ts + S), 
(m) \U{k)~tj{k)\ < S if max{tj(fc),tj(fc)} < t+, and (Hi) 
T- < \U{k + 1) - U{k)\ < T+ if U{k)+T+ < t+. 

Note that the fact that ^ is a deterministic function and, 
more generally, that we consider each space AS individually, 
is no restriction: As V succeeds for any adversarial space 
with probability at least p in achieving stabilization, the same 
holds true for randomized adversarial strategies A and worst- 
case drifts and delays. 

III. The fatal Pulse Synchronization Protocol 

In this section, we present our self-stabilizing pulse gener- 
ation algorithm. In order to be suitable for implementation 
in hardware, it needs to utilize very simple rules only. It 
is stated in terms of state machines as introduced in the 
previous section. 

Since the ultimate goal of the pulse generation algorithm 
is to interact with an application layer, we introduce a 
possibility for a coupling with such a layer in the pulse 
generation algorithm itself: for each node i, we add a further 
port NEXTi, which can be driven by node i's application 
layer. As for other state signals, its output raises flag 

'-We use the terms "algorithm" and "protocol" interchangeably through- 
out this work. 



Mem.j NEXT' to which for simphcity we refer to as NEXT, as 
well. The purpose of the port is to allow the application layer 
to influence the time between two of node i's successively 
generated pulses within a range that does not prevent the 
pulse generation algorithm to stabilize correctly. 

In Section VI we give an example for an application 



layer: The quick cycle completing the FATAL+ is a non-self- 
stabilizing clock synchonization routine which relies on the 
pulse generation algorithm for self-stabilization. Since we 
will show that the pulse algorithm stabilizes independently 
of the behavior of the NEXT signal, and the clock synchro- 



nization routine presented Section VI is designed such that 
it will stabilize once the pulse generation algorithm did so, 
we can partition the analysis of the compound algorithm 
into two parts. When proving the correctness of the pulse 



generation algorithm in Section IV we thus assume that for 



each node i, NEXTj is arbitrary. 

A. Basic Cycle 

The full pulse generation algorithm makes use of a rather 
involved interplay between conditions on timeouts, states, 
and thresholds to converge to a safe state despite a limited 
number of faulty components. As our approach is thus 
complicated to present in bulk, we break it down into pieces. 
Moreover, to facilitate giving intuition about the key ideas 
of the algorithm, in this subsection we assume that there are 
never more than / < n/3 faulty nodes, i.e., the remaining 
n — / nodes are non-faulty within [0, oo). We further assume 
that channels between non-faulty nodes (including loopback 
channels) are correct within [0, oo). We start by presenting 
the basic cycle that is repeated every pulse once a safe 



configuration is reached (see Figure 1 



We employ graphical representations of the state machine 
of each node i E V. States are represented by circles con- 
taining their names, while transition (s, s') € T is depicted 
as an arrow from s to s'. The guard tr{s, s') is written 
as a label next to the arrow, and the reset function's value 
re{s, s') is depicted in a rectangular box on the arrow. To 
keep labels more simple we make use of some abbreviations. 
Recall that in the notation of timeouts (T, s, C) the driving 
clock C is omitted. We write T instead of (T, s) if s is the 
same state which node i leaves if the condition involving 
(T, s) is satisfied. Threshold conditions like "> / + 1 s", 
where s S S, abbreviate Boolean predicates that reach over 
all of node i's memory flags Mem^ j s, where j E V, 
and are defined in a straightforward manner If in such an 
expression we connect two states by "or", e.g., " > n — f 
s or s' " for s, s' E S, the summation considers flags of 
both types s and s' . Thus, such an expression is equivalent 
to X)jev m^-xlMemi j-.s, Memi j_s'} > / + 1. For any state 
s S S, the condition Si^i = s, (respectively, -^{Si^i = s)) is 
written in short as "in s" (respectively, "not in s"). We write 
"true" instead of a condition that is always true (like e.g. "(in 
s) or (not in s)" for an arbitrary state s E §). Finally, re{-,-) 



always requires to reset all memory flags of certain types, 
hence we write e.g. propose if all flags Meniij ^propose are to 
be reset. 

We now briefly introduce the basic flow of the algorithm 
once it stabilizes, i.e., once all ri — / non-faulty nodes are 
well-synchronized. Recall that the remaining up to / < n/3 
faulty nodes may produce arbitrary signals on their outgoing 
channels. A pulse is locally triggered by switching to state 
accept. Thus, assume that at some time all non-faulty nodes 
switch to state accept within a time window of 2d, i.e., 
a pulses are generated by non-faulty nodes within a time 
interval of size 2d. Supposing that Ti > 3dd, these nodes 
will observe, and thus memorize, each other and themselves 
in state accept within a time interval of size 3d and thus 
before Ti expires at any non-faulty node. This makes 
timeout Ti the critical condition for switching to state sleep. 
From state sleep, they will switch to states sleep — > waking, 
waking, and finally ready, where the timeout (T2 , accept) is 
determining the time this takes, as it is considerably larger 
than 'd{2'd + 2)Ti. The intermediate states serve the purpose 
of achieving stabihzation, hence we leave them out for the 
moment. 

Note that upon switching to state ready, nodes reset 
their propose flags and NEXT,;. Thus, they essentially ignore 
these signals between the most recent time they switched 
to propose before switching to accept and the subsequent 
time when they switch to ready. Since nodes already reset 
their accept flags upon switching to waking, this ensures that 
nodes do not take into account outdated information for the 
decision when to switch to state propose. 

Hence, it is guaranteed that the first node switching from 
state ready to state propose again does so because T4 expired 
or because T-^ expired and its NEXT memory flag is true. 
The constraint min{r3,T4} > i?(T2 + M) ensures that all 
non-faulty nodes observe themselves in state ready before 
the first one switches to propose. Hence, no node deletes 
information about nodes that switch to propose again after 
the previous pulse. 

The first non-faulty node that switches to state accept 
again cannot do so before it memorizes at least n — / nodes 
in state propose, as the accept flags have been reset upon 
switching to state waking. Therefore, at this time at least 
n — 2/ > f + l non-faulty nodes are in state propose. Hence, 
the rule that nodes switch to propose if they memorize f + l 
nodes in states propose will take effect, i.e., the remaining 
non-faulty nodes in state ready switch to propose after less 
than d time. Another d time later all non-faulty nodes in 
state propose will have become aware of this and switch to 
state accept as well, as the threshold of n — / nodes in states 
propose or accept is reached. Thus the cycle is complete and 
the reasoning can be repeated inductively. 

Clearly, for this line of argumentation to be valid, the 
algorithm could be simpler than stated in [Figure l] We 
already mentioned that the motivation of having three in- 
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Figure 1. Basic cycle of node i once the algoritlim lias stabilized. 



termediate states between accept and ready is to facilitate 
stabilization. Similarly, there is no need to make use of the 
accept flags in the basic cycle at all; in fact, it adversely 
affects the constraints the timeouts need to satisfy for the 
above reasoning to be valid. However, the accept flags are 
much better suited for diagnostic purposes than the propose 
flags, since nodes are expected to switch to accept in a 
small time window and remain in state accept for a small 
period of time only (for all our results, it is sufficient if 
Ti — 4'0d). Moreover, two different timeout conditions for 
switching from ready to propose are unnecessary for correct 
operation of the pulse synchronization routine. As discussed 
before, they are introduced in order to allow for a seamless 
coupling to the application layer. 

B. Main Algorithm 

We proceed by describing the main routine of the pulse 
algorithm in full. Alongside the main routine, several other 
state machines run concurrently and provide additional in- 
formation to be used during recovery, as we detail later 



non-faulty nodes need to memorize at least n — f nodes in 
state accept. If this does not happen within Ad < Ti/d time 
after switching to state accept, by the arguments given in 



The main routine is graphically presented in Figure 2 



Except for the states recover and join and additional resets of 
memory flags, the main routine is identical to the basic cycle. 
The purpose of the two additional states is the following: 
Nodes switch to state recover once they detect that some- 
thing is wrong, that is, non-faulty nodes do not execute the 



basic cycle as outlined in Section III-A This way, non-faulty 
nodes will not continue to confuse others by sending for 
example state signals propose or accept despite clearly being 
out-of-sync. There are various consistency checks that nodes 
perform during each execution of the basic cycle. The first 
one is that in order to switch from state accept to state sleep. 



Section III-A the nodes could not have entered state accept 
within 2d of each other. Therefore, something must be wrong 
and it is feasible to switch to state recover. Next, whenever 
a non-faulty node is in state waking, there should be no non- 
faulty nodes in states accept or recover. Considering that the 
node resets its accept and recover flags upon switching to 
waking, it should not memorize /+! or more nodes in states 
accept or recover at a time when it observes itself in state 
waking. If it does, however, it again switches to state recover. 
Last but not least, during a synchronized execution of the 
basic cycle, no non-faulty node may be in state propose for 
more than a certain amount of time before switching to state 
accept. Therefore, nodes will switch from propose to recover 
when timeout T5 expires. 

There are two different ways for nodes in recover to 
switch back to the basic cycle, corresponding to two different 
mechanisms for stabilization. The transition from recover to 
accept requires to (directly) observe n — f nodes in state 
accept. This enables nodes to resynchronize provided that 
at least n ~ f nodes are already executing the basic cycle 
in synchrony. While this method is easily implemented, 
clearly it is insufficient to ensure stabilization from arbitrary 
initial configurations. Hence, nodes can also join the basic 
cycle again via the second new state, called join. Since 
the Byzantine nodes may "play nice" towards n — 2/ or 
more nodes still executing the basic cycle, making them 
believe that system operation continues as usual, it must be 
possible to do so without having a majority of nodes in state 
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Figure 2. Overview of the core routine of node i's self-stabilizing pulse algorithm. 



recover. On the other hand, it is crucial that this happens in 
a sufficiently well-synchronized manner, as otherwise nodes 
could drop out of the basic cycle again because the various 
checks of consistency detect an erroneous execution of the 
basic cycle. 

In part, this issue is solved by an additional agreement 
step. In order to enter the basic cycle again, nodes need to 
memorize n — f nodes in states join (the respective nodes 
detected an inconsistency), propose (these nodes continued 
to execute the basic cycle), or accept (there are executions 
where nodes reset their propose flags because of switching 
to join when other nodes already switched to accept). The 
threshold conditions of / + 1 nodes memorized in state 
join or / + 1 nodes memorized in state propose for leaving 
state recover, all nodes will follow the first one switching 
from join to propose quickly, just as with the switch from 
propose to accept in an ordinary execution of the basic cycle. 
However, it is decisive that all nodes are in states that permit 
to participate in this agreement step in order to guarantee 
success of this approach. 

As a result, still a certain degree of synchronization needs 
to be established beforehand^] both among nodes that stifl 

''This is the reason for the complicated transition condition involving 
additional states and timeouts. The detailed interplay between these con- 
ditions is delicate and beyond the scope of a high-level description of the 
algorithm; the interested reader is referred to the analysis section. 



execute the basic cycle and those that do not. For instance, if 
at the point in time when a majority of nodes and channels 
become non-faulty, some nodes already memorize nodes 
in join that are not, they may switch to state join and 
subsequently propose prematurely, causing others to have 
inconsistent memory flags as well. Byzantine faults may 
sustain such amiss configuration of the system indefinitely. 

So why did we put so much effort in "shifting" the focus 
to this part of the algorithm? The key advantage is that nodes 
outside the basic cycle may take into account less reliable 
information for stabilization purposes. They may take the 
risk of metastable upsets (as we know it is impossible to 
avoid these during the stabilization process, anyway) and 
make use of randomization. 

In fact, to make the above scheme work, it is sufficient that 
all non-faulty nodes agree on a so-called re synchronization 



point (cf. Definitions 3.1 and 3.2 1, that is, a point in time 
at which nodes reset the memory flags for states join 
and sleep — waking as well as certain timeouts, while 
guaranteeing that no node is in these states close to the 
respective reset times. Except for state sleep — > waking, 
all of these timeouts, memory flags, etc. are not part of the 
basic cycle at all, thus nodes may enforce consistent values 
for them easily when agreeing on such a resynchronization 
point. 

Conveniently, the use of randomization also ensures that 
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Figure 3. Extension of node i's core routine. 



it is quite unlikely that nodes are in state sleep waking 
close to a resynchronization point, as the consistency check 
of having to memorize n — / nodes in state accept in order 
to switch to state sleep guarantees that the time windows 
during which non-faulty nodes may switch to sleep make 
up a small fraction of all times only. 

Consequently, the remaining components of the algorithm 
deal with agreeing on resynchronization points and utilizing 
this information in an appropriate way to ensure stabilization 
of the main routine. We describe this connection to the 
main routine first. It is done by another, quite simple state 
machine, which runs in parallel alongside the core routine. 



It is depicted in Figure 3 



Its purpose is to reset memory flags in a consistent way 
and to determine when a node is permitted to switch to 
join. In general, a resynchronization point (locally observed 
by switching to state resync, which is introduced later) 
triggers the reset of the join and sleep waking flags. 
If there are still nodes executing the basic cycle, a node 
may become aware of it by observing / + 1 nodes in state 
sleep waking at some time. In this case it switches 
from the state passive, which it entered at the point in 
time when it locally observed the resynchronization point, 
to the state active. Subsequently, once timeout Tg expires, 
the node will switch to state, in which it is more susceptive 
to switching to state join. This is expressed by the rather 



involved transition rule tr [recover, join) (in Figure 2 1. Tg is 
much smaller than T-j, but Tg is of no concern until the node 
switches to state active and resets rgH^^^ condition that 
Memi.ijom = simply means that nodes should not already 
have attempted to stabilize by switching to join since the 
most recent transition to passive. This avoids interfering too 
much with the second stabilization mechanism (switching 
from recover to accept), as it might take significantly longer 

'■*The conditions "in active" and "not in dormant", respectively, here 
ensure tliat the transition is not performed because the node has been in 
state resync a long time ago, but there was no recent switching to resync. 



than the time required for this "immediate" recovery to 
stabilize by means of agreeing on a resynchronization point. 

It remains to explain how resynchronization points are 
generated. 

C. Resynchronization Algorithm 



The resynchronization routine is specified in Figure 4 



Similarly to the extension of the core routine, it is a lower 
layer that the core routine uses for stabilization purposes 
only. It provides some synchronization that is akin to that 
of a pulse, except that such "weak pulses" occur at random 
times, and may be generated inconsistently even after the 
algorithm as a whole has stabilized. Since the main routine 
operates independently of the resynchronization routine once 
the system has stabilized, we can afford the weaker guar- 
antees of the routine: If it succeeds in generating a "good" 
resynchronization point merely once, the main routine will 
stabilize deterministically. 

Definition 3.1 (Resynchronization Points): Given W C 
V, time t is a W -resynchronization point iff each node in 
W switches to state supp — resync in the time interval 
{t,t + 2d). 

Definition 3.2 (Good Resynchronization Points): 
A W^-resynchronization point is called good iff no node from 
W switches to state sleep during [t — Ag,t), where Ag := 
{2-d + 3)Ti, and no node is in state join during [t — Ti — 
d, t + Ad). 

In order to clarify that despite having a linear number of 
states (suppi, . . . ,suppj^), this part of the algorithm can be 
implemented using 2-bit communication channels between 
state machines only, we generalize our description of state 
machines as follows. If a state is depicted as a circle 
separated into an upper and a lower part, the upper part 
denotes the local state, while the lower part indicates the 
signal state to which it is mapped. A node's memory flags 
then store the respective signal states only, i.e., remote nodes 
do not distinguish between states that share the same signal. 
Clearly, such a machine can be simulated by a machine as 
introduced in the model section. The advantage is that such 
a mapping can be used to reduce the number of transmitted 



state bits; for the resynchronization routine given in Figure 4 



we merely need two bits (init/wait and nonelsupp) instead 
of [log(n + 3)1+1 bits. 

The basic idea behind the resynchronization algorithm is 
the following: Every now and then, nodes will try to initiate 
agreement on a resynchronization point. This is the purpose 
of the small state machine on the left in |Figure~4| Recalling 
that the transition condition "true" simply means that the 
node switches to state wait again as soon as it observes itself 
in state init, it is easy to see that it does nothing else than 
creating an init signal as soon as i?3 expires and resetting 
i?3 again as quickly as possible. As the time when a node 
switches to init is determined by the randomized timeout 
i?3 distributed over a large interval (cf. Equality ( 1 1 1 1 only, 




therefore it is impossible to predict when it will expire, even 
with full knowledge of the execution up to the current point 
in time. Note that the complete independence of this part 
of node «'s state from the remaining protocol implies that 
faulty nodes are not able to influence the respective times 
by any means. 

Consider now the state machine displayed on the right of 



Figure 4 To illustrate how the routine is intended to work, 



assume that at the time t when a non-faulty node i switches 
to state init, all non-faulty nodes are not in any of the states 
supp — > resync, resync, or supp i, and at all non-faulty nodes 
the timeout {R2,supp i) has expired. Then, no matter what 
the signals from faulty nodes or on faulty channels are, each 
non-faulty node will be in one of the states supp j, j £ V, 
or supp — resync at time t + d. Hence, they will observe 
each other (and themselves) in one of these states at some 
time smaller than t + 2d. These statements follow from the 
various timeout conditions of at least 2i!)d and the fact that 
observing node i in state init will make nodes switch to 
state supp i if in none or supp j, j ^ i. Hence, all of them 
will switch to state supp resync during {t,t + 2d), i.e., 
t is a resynchronization point. Since t follows a random 
distribution that is independent of the remaining algorithm 
and, as mentioned earlier, most of the times nodes do not 
switch to state sleep and it is easy to deal with the condition 
on join states, there is a large probability that t is a good 
resynchronization point. Note that timeout Ri makes sure 



that no non-faulty node will switch to supp — ?> resync again 
anytime soon, leaving sufficient time for the main routine to 
stabilize. 

The scenario we just described relies on the fact that 
at time t no node is in state supp resync or state 
resync. We will choose i?2 ^ Ri, implying that i?2 + -id 
time after a node switched to state init all nodes have 
"forgotten" about this, i.e., {R2,supp i) is expired and 
they switched back to state none (unless other init signals 
interfered). Thus, in the absence of Byzantine faults, the 
above requirement is easily achieved with a large probability 
by choosing R3 as a uniform distribution over some interval 
[R2+M, R2+0{nRi)]: Other nodes wiU switch to init 0{n) 
times during this interval, each time "blocking" other nodes 
for at most 0{Ri) time. If the random choice picks any 
other point in time during this interval, a resynchronization 
point occurs. Even if the clock speed of the clock driving R^ 
is manipulated in a worst-case manner (affecting the density 
of the probability distribution with respect to real time by 
a factor of at most i9), we can just increase the size of the 
interval to account for this. 

However, what happens if only some of the nodes receive 
an init signal due to faulty channels or nodes? If the same 
holds for some of the subsequent supp signals, it might 
happen that only a fraction of the nodes reaches the threshold 
for switching to state supp — > resync, resulting in an 
inconsistent reset of flags and timeouts across the system. 



Until the respective nodes switch to state none again, they 
will not support a resynchronization point again, i.e., about 
Ri time is "lost". This issue is the reason for the agreement 
step and the timeouts {R2,supp j). In order for any node 
to switch to state supp — > resync, there must be at least 
n — 2/ > / + 1 non-faulty nodes supporting this. Hence, all 
of these nodes recently switched to a state supp j for some 
j £ V, resetting {R2,supp j). Until these timeouts expire, 
/ + 1 e ri(?T.) non-faulty nodes will ignore init signals on 
the respective channels. Since there are 0{n?) channels, it is 
possible to choose R2 £ 0{nRi) such that this may happen 
at most 0{n) times in 0{n) time. Playing with constants, 
we can pick i?3 e C>{n) maintaining that still a constant 
fraction of the times are "good" in the sense that i?3 expiring 
at a non-faulty node will result in a good resynchronization 
point. 

D. Timeout Constraints 



Condition 3.3 summarizes the constraints we require on 



the timeouts for the core routine and the resynchronization 
algorithm to act and interact as intended. 

Condition 3.3 (Timeout Constraints): Recall that > 1 
and Ag := {2-d + 3)Ti. Define 
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We need to show that this system can always be solved. 
Furthermore, we would like to allow to couple the pulse 
generation algorithm to an application algorithm with any 
possible drift. To this end, we would like to be able to 
make the ratio {T2 + T4)/{'d{T2+T3 + Ad)) arbitrarily large: 
Thereby, (T2 + T4) is the minimal gap between successive 
pulses generated at each node, provided that the states of all 
the NEXT signals are constantly zero, and ■d{T2 + T3 + Ad) 
is the maximal time it takes nodes to observe themselves in 
state ready with T3 expired after the last generated pulse (as 
then they will respond to NEXT; switching to one). 



Lemma 3.4: For any d,i) £ 0(1), Condition 3.3 can be 
satisfied with Ti, . . . , T7, i?i e 0(1) and i?2 e ©(n), where 
the ratio 
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T2 



Ad 



maybe chosen to be an arbitrarily large constant. 



Proof: First, observe that if Inequa! 



denominator in the right hand side of 
positive. Thus, we can equivalently state 



ity (|3|l| holds, t he 



Inequality (|12|l 
Inequality (|12|i ' 



IS 

as 



T2 > 



2-dAg + (1 - \){d - l)Ti + (4 - \)d 
T^^A ■ 



(13) 



Since A e (4/5, 1), this inequality clearly imposes a stronger 
constraint tha n [Inequality ( 3 i hence we can replace Inequal- 
ities ([3]) and \\2\ with this one and obtain an equivalent 
system. The requirement of {T2+Ti) / {■d{T2+T'i+Ad)) = a 
can be rephrased as 



T4 > {a'd - 1)T2 + ad(T3 + Ad). 



(14) 



Again, clearly this constraint is stronger than Inequality ( 5 1 



hence we drop [Inequality (|5]i] in favor of [Inequality (|14|i 



We satisfy the inequalities by iteratively defining the val- 
ues of the left hand sides in accordance with the respective 
constraint, in the order ([g), ([13]), 0, ([T4|, (|6|, (|9|, 
and finally ( [TO| l. Note that this is feasible, as in any step the 
right hand side of the c urrent inequality is an expression in 
d, I?, a, and, in case of Inequality ([TOli n — /{^ We obtain 



'^For simplicity, we refrain from demanding equality and drop terms in 
order to get more condensed expressions. For i3 < 1.2, for example, the 
increase in the bounds is not significant. 



the solution 



T2 ■■= 

n ■■= 

T4 ~ 



AM 

1 - A 
Ae^^d 
1 - A 

(^92 _ l)46i?3d 

T^A ^ 
46i93(ai93 - l)d 

1 - A 
46i94(ai93 _ i)d 

1 - A 

— + 78a-d^d 

1 — A 

46i}^{3a^^ - l)d 

r^^A ^ 

(92i?^(3at?3 - 

O^Ap 
(218a??^ + 108i92))(n- /)d 
^ 1- A ■ 

As a G 0(1) was arbitrary, d and are constants, and 
A € (4/5, 1) depends on ^ only and is thus a constant as 
well, these values satisfy the asymptotic bounds stated in 
the lemma, concluding the proof. ■ 

IV. Analysis 

In this section we derive skew bounds E, as well as 
accuracy bounds r^,T+, such that the presented protocol 
is a {W, i?)-stabilizing pulse synchronization protocol, for 
proper choices of the set of nodes W and the set of channels 
E, with skew E and accuracy bounds , T+ that stabilizes 
within time T(fc) e 0{kn) with probability 1 - 2-'=("-/), 
for any k E N. This analysis follows the lines of ll30l . 
with minor adjustments due to the changes made to the 
FATAL protocol. Moreover, we show that if a set of at least 
n — f nodes fires pulses regularly, then other non-faulty 
nodes synchronize within 0{Ri) time deterministically. This 
stabilization mechanism is much simpler; the main challenge 
here is to avoid interference with the other approach. 

A. Basic Statements 

To start our analysis, we need to define the basic require- 
ments for stabilization. Essentially, we need that a majority 
of nodes is non-faulty and the channels between them are 
correct. However, the first part of the stabilization process 
is simply that nodes "forget" about past events that are 
captured by their timeouts. Therefore, we demand that these 
nodes indeed have been non-faulty for a time period that 
is sufficiently large to ensure that all timeouts have been 
reset at least once after the considered set of nodes became 
non-faulty. 



Definition 4.1 (Coherent Nodes): A subset of nodes 
W QV is called coherent during the time interval [t^, 
iff during [t- - (i9(i?2 + 3d) + 8(1 - A)i?2) - d,t+] all 
nodes i G are non-faulty, and all channels Si_j, i,j £ W, 
are correct. 

We will show that if a coherent set of at least n — f nodes 
fires a pulse, i.e., switches to accept in a tight synchrony, this 
set will generate pulses deterministically and with controlled 
frequency, as long the set remains coherent. This motivates 
the following definitions. 

Definition 4.2 (Stabilization Points): We call time t a W- 
stabilization point (quasi-stabilization point) iff all nodes 
i G W switch to accept during [t, t + 2d) {[t,t + 3d)). 

Throughout this section, we assume the set of coherent 
nodes W with \ W\ > n—f to be fixed and consider all nodes 
in and channels originating from V \ W as (potentially) 
faulty. As all our statements refer to nodes in W, we will 
typically omit the word "non-faulty" when referring to the 
behavior or states of nodes in W, and "all nodes" is short for 
"all nodes in W". Note, however, that we will still clearly 
distinguish between channels originating at faulty and non- 
faulty nodes, respectively, to nodes in W . 

As a first step, we observe that at times when W is 
coherent, indeed all nodes reset their timeouts, basing the 
respective state transition on proper perception of nodes 
in W. 

Lemma 4.3: If W is coherent during the time interval 
[t-,t+], with t- > ■d{R2 + 3d) + 8(1 - A)i?2 + d, any 
(randomized) timeout (T, s) of any node i E W expiring at 
a time t G has been reset at least once since time 

t- - (t9(i?2 + 3d) + 8(1 - A)i?2). If t' denotes the time 
when such a reset occurred, for any j G it holds that 
Sij{t') — Sj{Tjl{t')), i.e., at time t' , i observes j in a 
state j attained when it was non-faulty. 



Proof According to Condition 3.3[ the largest possible 
value of any (randomized) timeout is i?(i?2 + 3d) + 8(1 — 
A)i?2- Hence, any timeout that is in state 1 at a time smaller 
than i--(??(i?2+3d) + 8(l-A)i?2) > expires before time 

or is reset at least once. As by the definition of coherency 
all nodes in W are non-faulty and all channels between such 
nodes are correct during [i"-(t9(i?2+3d)+8(l-A)i?2),t+], 
this implies the statement of the lemma. ■ 

Phrased informally, any corruption of timeout and channel 
states eventually ceases, as correct timeouts expire and 
correct links remember no events that lie d or more time 
in the past. Proper cleaning of the memory flags is more 
complicated and will be explained further down the road. 

Throughout this section, we will assume for the sake of 
simplicity that the set W is coherent at all times and use 
this lemma implicitly, e.g. we will always assume that nodes 
from W will observe all other nodes from W in states that 
they indeed had less than d time ago, expiring of randomized 
timeouts at non-faulty nodes cannot be predicted accurately. 



etc. We will discuss more general settings in Section V 



We proceed by showing that once all nodes in W switch to 
accept in a short period of time, i.e., a VK-quasi-stabilization 
point is reached, the algorithm guarantees that synchronized 
pulses are generated deterministically with a frequency that 
is bounded both from above and below. 

Theorem 4.4: Suppose i is a H^-quasi-stabilization point. 
Then 

(i) all nodes in W switch to accept exactly once within 
[i, t + Sd), and do not leave accept until t + 4(i; and 

(ii) there will be a M^-stabilization point t' G (i + (T2 + 
T3 )/-(?, t + r2 + r4 + 5d) satisfying that no node in W 
switches to accept in the time interval [t + 3d, t'); and 
that 

(iii) each node i's, i G W , core state machine ( [Figure l i is 
metastability-free during \t + Sd, i! + 3(i] . 

Proof: Proof of (i): Due to [Inequality (|2|l| a node does 
not leave the state accept earlier than Tij-d > Ad time after 
switching to it. Thus, no node can switch to accept twice 
during [t, t+3d). By definition of a quasi-stabilization point, 
every node does switch to accept in the interval [t, t + 3d) C 
[t,t + Ti/d). This proves Statement (i). 

Proof of (ii): For each i e W, let U G [t,t + M) be the 
time when i switches to accept. By (i) ti is well-defined. 
Further let t'^ be the infimum of times in (ii,oo) when i 
switches to recover or /?ro/?o.se{^ In the following, denote 
by i G a node with minimal t[. 

We will show that all nodes switch to propose via states 
sleep, sleep — >■ waking, waking, and ready in the presented 
order. By (i) nodes do not leave accept before t^Ad. Thus at 
time t + Ad, each node in W is in state accept and observes 
each other node in W in accept. Hence, each node in W 
memorizes each other node in W in accept at time t+Ad. For 
each node j G W, let ^ be the time node j's timeout Ti 
expires first after t^. Then tj^^ e {tj + Tif-d, tj + Ti + 
Since \W\ > n — f, each node j switches to state sleep 
at time tj^s- Hence, by time t + Ti + Ad, no node will be 
observed in state accept anymore (until the time when it 
switches to accept again). 

When a node j E W switches to state waking at the 
minimal time larger than tj, it does not do so earlier than 
attimet+Ti/z9+(2+l/i?)Ti = t+{2+2/d)Ti > t+Ti+Ad. 
This implies that all nodes in W have already left accept 
at least d time ago, since they switched to it at their 
respective times tj < t + Ti + 3d. Moreover, they cannot 
switch to accept again until t'^ as it is minimal and nodes 
need to switch to propose or recover before switching 
to accept. Hence, nodes in W are not observed in state 
accept during {t + Ti + Ad, t'j], in particular not by node j. 
Furthermore, nodes in W are not observed in state recover 

'*Note that we follow the convention that inf = 00 if the infimum is 
taken with respect to a (from above) unbounded subset of R J . 

"The upper bound compiises an additive term of d since Ti is reset at 
some time from {tj,tj + d). 



during — d,t'j]. As it resets its accept and recover flags 
upon switching to waking, j will hence neither switch from 
waking to recover nor from ready to propose during (ty,, i^). 

Now consider node i. By the previous observation, it will 
not switch from waking to recover, but to ready, following 
the basic cycle. Consequently, it must wait for timeout T2 
to expire, i.e., cannot switch to ready earlier than at time 
t+T2/i3. By definition of t'^, node i thus switches to propose 
at time t'^. As it is the first node that does so, this cannot 
happen before timeouts T3 or T4 expire, i.e., before time 

^t + T2 + 5d. (15) 



t- 



T2 min{r3,r4} (5) T2 



All other nodes in W will switch to waking, and for the 
first time after tj, observe themselves in state waking at 
a time within {t + Ti + Ad,t + Ti{2 + -d) + Id). Recall 
that unless they memorize at least / + 1 nodes in accept or 
recover while being in state waking, they will aU switch to 
state ready by time 

max{< + r2+4d,t + (2?9 + 2)Ti + 7d} it + T2+4d. (16) 

As we just showed that t[ > t + T2 + 5d, this implies that 
at time t + T2 + 5d all nodes are observed in state ready, 
and none of them leaves before time t'^. 

Now choose t' to be the infimum of times from + (T2 + 
T3) /d, t + T2+T4 + Ad] when a node in W switches to state 
accepfp^ Because of Inequality ([T5]l node j cannot switch 
to propose within [tj,t + (T2 + Tr^)/^). Thus, (after time 
t + 3d) node j does not switch to accept again earlier than 
time t', and timeout T5 cannot expire at j until time 



t + 



T2 + T3 + n 



t + T2+n + 7d>t' + 3d, (17) 



making it impossible for j to switch from propose to recover 
at a time within [tj , t' + 3d] . What is more, a node from W 
that switches to accept must stay there for at least Ti/d > 
3d time. Thus, by definition of t', no node j E W can 
switch from accept to recover at a time within [tj,t' + 3d]. 
Hence, no node j E W can switch to state recover after 
tj, but earlier than time t' + 2d. It follows that no node in 
W can switch to other states than propose or accept during 
[< + T2 + Ad, t' + 2d] . In particular, no node in W resets its 
propose flags during + T2 + 5d, t' + 2d] D [t^, + 2d]. 

If at time t' a node in W switches to state accept, n—2f > 
/ + 1 of its propose flags corresponding to nodes in W are 
true, i.e., all correspond to a flag holding 1. As the node reset 
its propose flags at the most recent time when it switched to 
ready and no nodes from W have been observed in propose 
between this time and t', it holds that / + 1 nodes in W 
switched to state propose during [t'^, t'). Since we established 
that no node resets its propose flags during [t[,t' + 2d], it 
follows that all nodes are in state propose by time t' + d. 

'**Note that since we take the infimum on {t + (T2 +T3)/^,t + T2 + 
T4 + 4d], we have that t' <t + T2 + T4 + id. 



Consequently, all nodes in W will observe all nodes in W in 
state propose before time t' + 2d and switch to accept, i.e., 
t' e{t + {T2 + T3 )/■(?, t + T2 + r4 + Ad) is a W-stabilization 
point. Statement (ii) follows. 

On the other hand, if at time t' no node in W switches 
to state accept, it follows that t' = t + T2+T4 + 4d. As all 
nodes observe themselves in state ready by time t + T2 + 5d, 
they switch to propose before time t + T2+T4 + 5d = t' +d 
because T4 expired. By the same reasoning as in the previous 
case, they switch to accept before time t'+2d, i.e.. Statement 
(ii) holds as well. 

Proof of (iii): We have shown that within [tj,t' + 3d], any 
node j ^ W switches to states along the basic cycle only. 
Note that Condition (ii) in the definition of metastability- 
freedom is satisfied by definition for state transitions along 
the basic cycle, as the conditions involve memory flags and 
timeouts (that are not associated with the states the nodes 
switch to) only. To show the correctness of Statement (iii), 
it is thus sufficient to prove that, whenever j switches 
from state s of the basic cycle to s' of the basic cycle 
during time [tj,t' + 3d], the transition from s to recover is 
disabled from the time it switches to s' until it observes itself 
in this state. We consider transitions tr {accept, recover), 
tr{waking, recover), and tr{propose, recover) one after the 
other: 

1) tr {accept, recover): We showed that node j's condi- 
tion tr{accept, sleep) is satisfied before time t + Ad< 
t + Ti/'d, i.e., before tr {accept, recover) can hold, and 
no node resets its accept flags less than d time after 
switching to state sleep. When j switches to state 
accept again at or after time t' , Ti will not expire 
earlier than time t' + Ad. 

2) tr {waking, recover): As part of the reasoning about 
Statement (ii), we derived that tr{waking, recover) 
does not hold at nodes from W observing themselves 
in state waking. 

3) tr {propose, recover): The additional slack of d in 
[Inequality (|17 1 ensures that T5 does not expire at any 
node in W switching to state accept during {t' , t'+2d) 
earlier than time t' + 3d. 

Since [tj,t' + 3d\ D [t + 3d,t' + 3d]. Statement (iii) follows. 



Inductive application of Theorem 4.4 shows that by con 



struction of our algorithm, nodes in W provably do not 
suffer from metastability upsets once a W^-quasi-stabilization 
point is reached, as long as all nodes in W remain non-faulty 
and the channels connecting them correct. Unfortunately, it 
can be shown that it is impossible to ensure this property 
during the stabilization period, thus rendering a formal 
treatment infeasible. This is not a peculiarity of our system 
model, but a threat to any model that allows for the possi- 
bility of metastable upsets as encountered in physical chip 
designs. However, it was shown that, by proper chip design. 



the probability of metastable upsets can be made arbitrarily 
smaU 129 1 {^/n 

the remainder of this work, we will therefore 
assume that all non-faulty nodes are metastability-free in all 
executions. 

The next lemma reveals a very basic property of the main 
algorithm that is satisfied if no nodes may switch to state 
join in a given period of time. It states that if a non-faulty 
node switches to state sleep, other non-faulty nodes cannot 
remain too far ahead or behind in their execution of the basic 
cycle. 

Lemma 4.5: Assume that at time t^ieep, some node from 
W switches to sleep and no node from W is in state join 
during [tsieep — Ti ~ d, t sleep + 2Ti + 3d]. Then 

(i) at time t^ieep + 2Ti + 3d, any node is in one of the states 
sleep, sleep — >■ waking, waking, or recover, 

(ii) any node in states sleep, sleep — )■ waking, or waking 
reset its timeout T2 at some time from {t^ieep ~ — 
4d, t,ieep + (2 - l/^)Ti + 3d); and 

(iii) no node switches from recover to accept during [tsieep + 
Ti + 2d, ta], where ta > t^ieep + 2ri + 3d denotes the 
infimum of times larger than tsieep + Ti + 2d when a 
node switches to state accept. 

Proof We claim that there is a subset A C of at 
least n — 2f nodes such that each node from A has been in 
state accept at some time in the interval {tsieep~Ti—d, t^ieep)- 
To see this, observe that if a node switches to state sleep at 
time tsieep, it must have observed n~2f non-faulty nodes in 
state accept at times from {tsieep ~ Ti, tsieep], since it resets 
its accept flags at the time ta > tsieep — 7i (that is minimal 
with this property) when it switched to state accept. Each 
of these nodes must have been in state accept at some time 
from {tsieep ~ Ti ~ d, tsieep), showing the existence of a set 
A CW as claimed. 
During 



sleep 



^sieep 



^ [^sleep 



Ti + 2d, 



Ti + 2d, t 



d{2Ti + 3d) T2 



sleep 



-2d] 



no node from A is observed in state accept, as following the 
basic cycle requires T2 to expire, no node switches to join, 
and in order to switch directly from recover to accept, a 
timeout of d{2Ti +3d) needs to expire first. Since this also 
applies to the nodes from A and no node is in state join until 

"Note that it is feasible to incorporate this issue into the model by 
means of the probability space, as it is beyond control of "reasonable" 
adversaries to control signals on faulty channels (or ones that originate at 
non-faulty nodes) precisely enough to ensure more than a small probability 
of a metastable upsets. However, since it is (at best) impractical to consider 
metastable states of the system in our theoretical framework, essentially 
this approach boils down to counting the number of state transitions during 
stabilization where a non-faulty node is in danger of becoming metastable 
and control this risk by means of the union bound. 



time tsieep + 2Ti + 3d, the only way to do so is by following 
the basic cycle via states sleep, sleep — > waking, waking, 
ready, and propose. However, this takes at least until time 



?9 



^ sleep 



2Ti + 3d 



as well. This shows Statement (iii) of the lemma. 

Now consider any node that observes itself in one of the 
states waking, ready, or propose at time tsieep ^ Ti — d. By 
time tsieep + d, it will memorize all nodes from A in accept 
(provided that it did not switch to accept in the meantime). 
Hence, it satisfies tr{waking, recover), tr [ready ^ propose), 
and tr [propose, accept) until it switches to either recover or 
accept. It follows that any such node must have switched to 
recover or accept by time tg -Vid < tg + Ti + 2d. On the 
other hand, nodes that do not observe themselves in state 
waking at time t^ieep — Ti — d but are in one of the states 
sleep, sleep waking, or waking at this time or switch to 
sleep during time [tsieep ~ 7i — d, tsieep + STi + 3d] must 
have reset their timeout T2 at some time from 



tsleep-[2^ + 3)Ti-Ad,t 



sleep 



Ti+3d 



i.e.. Statement (ii) holds. To infer Statement (i), it remains 
to show that none of the latter nodes may switch to ready 
until time tsieep + 2Ti + 3d. As no nodes from W are in state 
join during [tsieep — Ti — d, tsieep + STi + 3d] by assumption, 
the stetement follows immediately from Statement (ii), as 

tsieep - (2?? + 3)Ti + ^ ~ 4d f tsieep + 2Ti + 3d. 

The lemma follows. ■ 
Granted that nodes are not in state join for sufficiently 

long, this implies that nodes will switch to sleep in rough 

synchrony with others or drop out of the basic cycle and 

switch to recover. 

Corollary 4.6: Assume that at time tsieep, a node from W 

switches to sleep, no node is in state join during [tsieep — Ti — 



d, t. 



sleep 



' ^5 ' ^ sleep ) 



-2Ti+4d], and also that during [tsieep - 
[tsieep — (2i9 + 3)Ti, tsieep) HO nodc in W is in state sleep. 
Then at time tsieep + 2Ti + Ad, any node from W is either 
in one of the states sleep or sleep waking and observed 
in sleep, or it is and is observed in state recover. 

Proof: We apply Lemma 4.5 to see that at time tsieep + 



2Ti + 3d, all nodes are in one of the states sleep, sleep — > 
waking, waking, or recover. As nodes remain in sleep for a 
timeout of duration (2i5 + l)Ti > i9(2ri+4d), the statement 
of the corollary follows immediately provided that we can 
show that any node that does not switch to state sleep during 
[tsieep I tsieep + ?! + 3d] is not in State waking at time tsieep + 
Ti+3d. Consider such a node. If there is a time from [tsieep — 
'^g, tsieep + Ti + 3d] whcn the node is not in one of the 
states sleep, sleep waking, or waking, it cannot be in 
state waking at time tsieep + 2ri + 5d, since it could not have 
switched to sleep again in order to get there. Assume w.l.o.g. 



that the node switches to sleep exactly at time tsieep — Ag. 
Thus, it must have previously reset its timeout T2 no later 
than 

tsieep _ ^sleep 



- Ag - 4d. 



Hence we conclude by Lemma 4.5 that the node is in state 
recover at time tsieep + 2ri + 5d, finishing the proof. ■ 

B. Resynchronization Points 

In this section, we derive that within linear time, it is 
very likely that good resynchronization points occur. As a 
first step, we infer from |Lemma 4.5| that whenever nodes 
may not enter state join, the time windows during which 
nodes may switch to sleep occur infrequently. 

Lemma 4.7: Suppose no node is in state join during 
[<",i+]. Then the volume of times i G [i" + Ti + d,t+] 
satisfying that no node is in state sleep during [t — Ag, t) is 
at least 



T2 - 2dAg - (1? - l)Ti - AM 
- (t? - l)Ti - M 
-(4A<, + ri + 7d). 



[t+ - r 



Proof: Denote by t^ the infimum of times from [t^ + 
Ti+d, i+] when a node switches to sleep. Thus, by definition 
any time i G [t~ + Ag +Ti + d, to] satisfies that no node is 
in state sleep during [t — Ag,t). We proceed by induction 
over increasing times ti e [tQ,t^], i E {1, . . . ,imax}- The 
induction halts at index imax G N iff > t^ — T2/'d + 
(1 — l/z9)Ti + d. We claim that, for each i, the volume of 
times t E [t^ + Ti+d, ti] such that no node is in state sleep 
during [t, t — Ag) is at least 



and that 



t^-t- - [Ti + d) - i[2Ag + 3d) 



ti > t" - (2i? + 1 + l/z9)Ti - 2d 



(18) 



+i 



1 



Ti 



(19) 



In fact, we will show these bounds by establishing that no 
node is in state sleep during 



(t,_i + (2i? + 3)Ti + 3d, ti) = [t,_i 
and that 



ti > 



T2 



Ad 



Ag + 3d,i,) (20) 



2Ag + 3d (21) 



for all i e {1, 



5 



We first establish these bounds for ti. By Lemma 4.5 



every node not switching to state recover until time tg + 
Ti + 3d resets T2 at some time from [to — Ag — Ad, to + 3d) 
and is in one of the states sleep, sleep — > waking, or waking 
at time tg + Ti + 3d. Hence, such nodes do not switch to 



state ready and subsequently to propose, accept, and sleep 
again until to + T2/-& - Ag - Ad < t+, giving 

ti>to + ^~Ag- Ad. 
■d 

Moreover, the lemma implies that no node is in state sleep 
during [<o + (2?? + 3)Ti + 3d, ti), as any node in state sleep 
at time to + 2Ti +3d will leave after a timeout of {2-d + l)Ti 
expires. Hence, the volume of times t E [t^ + Ti + d,ti] 
such that no node is in state sleep during (t, t — Ag) is at 
least 



tn 



{t- +Ti+d + Ag) +ti- {ta + Ag + 3d), 



showing the claim for z = 1. 

We now perform the induction step from i < it, 
By (poll, no node is in state sleep during 



to 



(t,_l +Ag+ 3d, U) 3 {U - Ag, ti) 



Hence we can apply Corollary 4.6| to see that nodes not 
observing themselves in state sleep at time ti + 2Ti + Ad 
switched to state recover. Therefore, nodes that continue 
to execute the basic cycle must have performed their most 
recent reset of timeout T2 at or after time ti — Ti — d. Thus, 
such nodes do not switch to state ready and subsequently 
to propose, accept, and sleep again until ti+T2/'d — {1 — 
l/d)Ti - d<t+, giving 



t 



> t, 



T2 



1 



Ti - d. 



Moreover, no node is in state sleep during [ti + {2{) + 3)Ti + 
3(i, ii+i). These two statements show Inequality (|20[i| and 
[Inequality (|21|i| for i + 1, and by means of the induction 



hypothesis directly imply Inequality (18 1 and Inequality 
for i + 1 as well, i.e., the induction succeeds. 



19 1 



From Inequality (19 1 we have that 



< 



t+ - + (2i9 + 1 + l/^)Ti + 2d 

- (1 - - d 

t+ -t- 
T2/d - (1 - l/d)T, - d 



1. 



(22) 



Observe that the same reasoning as above shows that no 
node switches to sleep during + Ag + 3d, t+] since 

> t+ - (T2/^9 - (1 - l /d)T\ - d). Thus, inserting 
i = hnax into [Inequality ( TSlil we infer that the volume of 
times t E [t^ + Ti + d, t+] such that no node is in state 
sleep during {t, t — Ag) is at least 



t+-t- - (Ti +d)- 

T2 - 2^Ag -id- i)ri 



T2 - (t? - l)Ti 
(4Ag+Ti+7d), 



l){2Ag + 3d) 

AM\ , , 

{t+-t-) 



concluding the proof. 



We are now ready to advance to proving that good 
resynchronization points are likely to occur within bounded 
time, no matter what the strategy of the adversary is. To this 
end, we first establish that in any execution, at most of the 
times a node switching to state init will result in a good 
resynchronization point. This is formalized by the following 
definition. 

Definition 4.8 (Good Times): Given an execution £ of the 
system, denote by £' any execution satisfying that = 



£\ 



[0,t) 



where at time t a node i E W switches to state init in 



£'. Time t is good in £ with respect to W provided that for 
any such £' it holds that t is a good M^-resynchronization 
point in £'. 

The previous statement thus boils down to showing that in 
any execution, the majority of the times is good. 

Lemma 4.9: Given any execution £ and any time interval 
[t^ , the volume of good times in £ during [t~ ,t~^] is at 
least 



x'it+-t-)- 



Proof: Assume w.l.o.g. that \W\ — n ~ f (otherwise 
consider a subset of size n — f) and abbreviate 



> 



N 

( ^it+-t-) 11 

V R2 10 
}}{t+ -t-) + R2/10 



(n-f) 



(n-f) 



5 



'd{t+ -t-)+ + AAg + Ti + 10d)/(5(l - A)) 
(n-f) 

d{t+ -t- +Ri+ AAg + Ti + lOd) 



i?2 



(n-f). (23) 



The proof is in two steps: First we construct a measurable 
subset of [t^,t^] that comprises good times only. In a 
second step a lower bound on the volume of this set is 
derived. 

Constructing the set: Consider an arbitrary time t E 
[t^,t^], and assume a node i E W switches to state init 
at time t. When it does so, its timeout i?3 expires. By 



[Lemma 43 all timeouts of node i that expire at times within 
have been reset at least once until time t^ . Let tj^^ 
be the maximum time not later than t when i?3 was reset. 
Due to the distribution of we know that 

tR3^t-{R2+3d). 

Thus, node i is not in state init during time [t — {R2 + 2d), t), 
and no node j E W observes i in state init during time 
[t — (i?2 + d),t). Thereby any node j's, j E W, timeout 
{R2,supp i) corresponding to node i is expired at time t. 

We claim that the condition that no node from W is in or 
observed in one of the states resync or supp resync at time 



t is sufficient for t being a PF-resynchronization point. To 
see this, assume that the condition is satisfied. Thus all nodes 
j E W are in states none or supp k for some fee {!,...,«} 
at time t. By the algorithm, they all will switch to state supp i 
or state supp -> resync during (t, i + d). It might happen 
that they subsequently switch to another state supp k' for 
some k' £ V, but all of them wiU be in one of the states 
with signal supp during {t + d,t + 2d] . Consequently, all 
nodes will observe at least n — f nodes in state supp during 
{f ,t + 2d) for some time t' < t + 2d. Hence, those nodes in 
W that were still in state supp i (or supp k' for some k') at 
time t + d switch to state supp — > resync before time t + 2d, 
i.e., i is a VF-resynchronization point. 

We proceed by analyzing under which conditions i is a 
good VF-resynchronization point. Recall that in order for t 
to be good, it has to hold that no node from W switches 
to state sleep during {t — Ag , t) or is in state join during 
{t-Ti- d,t + 4d). 

We begin by characterizing subsets of good times within 
{tr, t'^) C [t^ , where and t\. are times such that during 
{tr,t'^) no node from W switches to state supp — ?> resync. 
Due to timeout 

Ri i {M + 2)d, 

we know that during (t^ + Ri + 2d, t'^), no node from W 
will be in, or be observed in, states supp — > resync or resync. 
Thus, if a node from W switches to init at a time within 
{tr+Ri+2d, t'^), it is a VF-resynchronization point. Further, 
all nodes in W will be in state dormant during {tr + Ri + 
2d, t'^ + 4d). Thus all nodes in W will be observed to be in 
state dormant during {tr + i?i + ?>d, t'r + 4c?), implying that 
they are not in state join during {tr + Ri + 3d, t'r + id). In 
particular, any time t G (i,. + R1+T1+ Ad, t'r) satisfies that 
no node in W is in state join during {t — Ti ~ d,t + Ad). 
Applying [Corollary 4.7 we infer that the total volume of 
times from {tr,t'r) that is good is at least 



covering events prior to t when nodes switch to supp 
resync. Formally, we define 



T2 - 2-0 Ag ~{d~ l)Ti - AM 

T2 - (i9 - l)Ti - M 
{AAg + Ti + lOd). 



{t+-t-) 



(24) 



In other words, up to a constant loss in each interval 
{tr,t'^), a constant fraction of the times are good. 

Volume of the set: In order to infer a lower bound on the 
volume of good times during [t^ , we subtract from t+ — 
t^ the volume of some intervals during which we cannot 
exclude that a node switches to supp — > resync, increased by 



the constant term Ri +4Ag + lOd from|InequaUty (|24|i 



The inequality then yields that at least a fraction of (T2 
2^Ag - {-d- l)Ti - AM)/{T2 -{-d- l)Ti - M) of the 
remaining volume of times is good. Note that we also need 
to account for the fact that nodes may already be in state 
supp — >■ resync at time t^, which we account for by also 



u 



[tr,tr+Rl+AAg+Ti + 10d] 



tre[t"-(fll+4Ag+Ti+10<i),t+] 
3iGW: i switches to supp^resync at tj^ 

and strive for a lower bound on the volume of [t^ , t+] \ G. 
In order to lower bound the good times in it is 

thus sufficient to cover all times when a node switches to 
supp -> resync during [t" - {Ri + AAg + Ti + lOd), t+] by 
2 — 1 intervals of volume at most V and then infer a lower 
bound of t+ -t- - 2N{V + Ri + AAg + Ti + lOd) on the 
volume of [t^,t^] \ G. The remainder of the proof hence 
is concerned with deriving such a cover of the times when 
nodes may switch to supp — ^ resync during [t^ — {Ri + 
AAg+Ti + 10d),t+]. 

Observe that any node in W does not switch to state init 
more than 



t+ -t- +Ri+ AAg +Ti + lOd' 
R3 

t+ -t- +Ri+ AAg + Ti + lOd' 
R2 + d 

N 
n - f 



(25) 



times during [t- - {Ri + AAg + Ti + I0d),t+]. 

Now consider the case that a node in W switches to 
state supp — )■ resync at a time t satisfying that no node 
in W switched to state init during {t — (8?? + 6)d,t). This 
necessitates that this node observes n — / of its channels in 
state supp during {t — {2-3 + l)d, t), at least n — 2/ > / + 1 
of which originate from nodes in W . As no node from W 
switched to init during {t — (Si? + 6)d, t), every node that 
has not observed a node i eV \ W m state init at a time 
from (t— (8i? + 4)d, t) when {R2,supp i) is expired must be 
in a state whose signal is none during {t — {2d + 2)rf, t) due 
to timeouts. Therefore its outgoing channels are not in state 
supp during {t~ {2-d + l)d,t). By means of contradiction, it 
thus follows that for each node j of the at least / + 1 nodes 
(which are all from W), there exists a node i E V \ W 
such that node j resets timeout {R2 , supp i) during the time 
interval {t - {8-d + A)d,t). 

The same reasoning applies to any time t' ^ {t — {81) + 
6)d,t) satisfying that some node in W switches to state 
supp — > resync at time t' and no node in W switched to 
state init during {t' - {M + 6)d, t'). Note that the set of the 
respective at least / + 1 events (corresponding to the at least 
/ + 1 nodes from W) where timeouts {R2,supp i) with i S 
y \ are reset and the set of the events corresponding to t 
are disjoint. However, the total number of events where such 
a timeout can be reset during [t~-(Ei+4Ag+Ti + 10c?), t+] 



is upper bounded by 

\v\w\\w\ 



t+ -t- +Ri+ 4A„ + Ti + IM' 



R2M 



(26) 



i.e., the total number of channels from nodes not in (|V^\ 
W\ many) to nodes in W multiplied by the number of times 
an associated timeout can expire at a receiving node in W 
during [t- - + 4Ag + Ti + \Qd),t+]. 

With the help of inequalities ( |25] l and ( |26] l, we can show 
that G can be covered by less than 2N intervals of size 
+4Ag +Ti + lOd) + (8i9 + 6)d each. By [Inequality"^ 
there are no more than N times is [t~ — {Ri + 4 Ag + Ti + 
10d),t^] when one of the \W\ ~ n — f many non-faulty 
nodes switches to init and thus may cause others to switch to 
state supp — >■ resync at times in [t, t + (8i? + 6)(i]. Similarly, 
Inequality (|26 1 shows that the channels from V \ W to W 
may cause at most — 1 such times t S \t~ — (i?i + 4Ag + 
Ti + 10(i),<+], since any such time requires the existence 
of at least / + 1 events where timeouts {R2,supp i), i E 
V\W, are reset at nodes in W, and the respective events are 
disjoint. Thus, all times £ [t^ -{Ri+AAg+Ti + 10d),t+] 
when some node i E W switches to supp resync are 
covered by at most 2N — 1 intervals of length (8i? + 6)d. 

This results in a cover G' Z) G consisting of at most 
2N — 1 intervals that satisfies 



vol (G) 



< 
< 



vol (G") 
2N{Ri 



-4Ag + Ti + (8z?+16)d). 



As argued previously, summing over the at most 2N in 
tervals that remain in [t^ , t+ 



\ G' and using Inequality (p4|l 
' good times during ,t^] is 



it follows that the volume of good times during [t , P 
at least 

T2 - 2i?Ag - (?9 - l)Ti - AM 



- (i? - l)Ti ~ dd 
{t+ -r - 2N{Ri + 4A<, - 



A 



2iV(i?i + 4Ag 



t+ -r -2 



(i?i + 4A,, - 
2'd{Ri 



1 - 



-4A, 



R^2 

+ (8^9 + 



11 
10 



(8?9 + 16)d) 
- (8i9 + 16)(i)) 



16)d)) 

|-(8i? + 16)d)(n 



/) 



i?2 



■xit+ -t-) 

llA(i?i + 4A3 + Ti + {M + 16}d){n - f) 



11(1 - A)fl2 



as claimed. The lemma follows. ■ 
We are now in the position to prove our second main 
theorem, which states that a good resynchronization point 
occurs within C'(i?2) time with overwhelming probability. 



Theorem 4.10: Denote by £3 i?(i?2 + M) + 8(1 - 
A)i?2 + d the maximal value the distribution i?3 can attain 
plus the at most d time until i?3 is reset whenever it expires. 
For any fc e N and any time t, with probability at least 
1 — (l/2)''("^^' there will be a good M^-resynchronization 
point during [t, t + (k + 1)-E3]. 

Proof: Assume w.l.o.g. that \W\ = n — f (otherwise 
consider a subset of size n — /). Fix some node i £ W and 
denote by to the infimum of times from [t, t + {k + l)i?3] 
when node i switches to init. We have that to < t + E3. By 
induction, it follows that node i will switch to state init at 
least another k times during [t, t + {k + l)!!!^] at the times 
ti < t2 < ... < tk- We claim that each such time tj, 
j E {1, .., fc}, has an independently by 1/2 lower bounded 
probability of being good and therefore being a good W- 
resynchronization point. 

We prove this by induction on j: As induction hypothesis, 
suppose for some j E {l,...,fc — 1}, we showed the 
statement for / E {l,...,j — 1} and the execution of 
the system is fixed until time tj-i, i.e., f |[o.tj_i] is given. 
Now consider the set of executions that are extensions of 
^l[o.tj_ii have the same clock functions as £. For each 



such execution £' it holds that £'\ 



[O-tj-i]- 



and 



all nodes' clocks make progress in £' as in £. Clearly each 
such £' has its own time tj < t + {j + 1)_E3 when R3 
expires next after tj_i at node i, and i switches to init. We 
next characterize the distribution of the times tj. 

As the rate of the clock driving node i's i?3 is between 
1 and -d, tj > is within an interval, call it of 
size at most 

t+ -t- <8(1-A)i?2, 

regardless of the progress that i's clock C makes in any 
execution £'. 



Certainly we can apply Lemma 4.9 also to each of the 
showing that the volume of times from that are not 

good in £' is at most 



{l-X'){t+-t-) + 



11(1- A)i?2 
10i9 



Since clock C can make progress not faster than at rate 
d and the probability density of R3 is constantly 1/(8(1 — 
A)i?2) (with respect to the clock function C), we obtain 
that the probability of tj not being a good time is upper 
bounded by 

(1 - A2)(i+ - t-) + 11(1 - A)i?2/(10z9) 



[I - X)R2/^ 



11 



< i?(l-A^) + — 



5 



9 



80 
7 1 



25t9 50 



Here we use that the time when R3 expires is independent 



off 



[0,tj- 



We complete our reasoning as follows. Given £|jo,tj_i]^ 
we permit an adversary to choose 8' , including random 
bits of all nodes and full knowledge of the future, with the 
exception that we deny it control or knowledge of the time tj 
when i?3 expires at node i, i.e., 8' is an imaginary execution 
in which i?3 does not expire at i at any time greater 
than ij-i. Note that for the good W^-resynchronization 
points we considered, the choice of 8' does not affect the 
probability that ti, . . . are good M^-resynchronization 

points: The conditions referring to times greater than a W- 
resynchronization point t, i.e., that all nodes in W switch to 
state supp — > resync during (i, t + 2d) and no node in W 
shall be in state join during {t — Ti — d,t + 4d), are already 
fully determined by the history of the system until time t. 
As we fixed 8', the behavior of the clock driving is fixed 
as well. Next, we determine the time tj when expires 
according to its distribution, given the behavior of node i's 
clock. The above reasoning shows that time tj is good in 8' 
with probability at least 1/2, independently of 8'\[o^tj_i] — 
£\[o,tj^i]- We define that f |[o,tj) = ^'l[o,ij) and in 8 node i 
switches to state init (because R^ expired). As — conditional 
to the clock driving R^ and tj^i being specified — tj is 
independent of f |[o.tj)^ £ is indistinguishable from 8' until 
time tj. Because tj is good with probability at least 1/2 
independently of f ||q ^ 8\[o.tj_i], so it is in 8. Hence, 

in 8 tj is a good T^-resynchronization point with probability 
1/2, independently of f |[o,tj_i]- Since 8' was chosen in an 
adversarial manner, this completes the induction step. 

In summary, we showed that for any node in W and 
any execution (in which we do not manipulate the times 
when i?3 expires at the respective node), starting from the 
second time during [t,t+ {k + l)E3] when i?3 expires at the 
respective node, there is a probability of at least 1/2 that the 
respective time is a good W-resynchronization point. Since 
we assumed that \W\ = n — f and there are at least k such 
times for each node in W, this implies that having no good 
VF-resynchronization point during [t, t + {k + l)E3] is as 
least as unlikely as k{n — f) unbiased and independent coin 
flips all showing tail, i.e., (l/2)'''^"^'^^. This concludes the 
proof. ■ 

C. Stabilization via Good Resynchronization Points 

Having established that eventually a good VF-resyn- 
chronization point tg will occur, we turn to proving the 
convergence of the main routine. We start with a few helper 
statements wrapping up that a good resynchronization point 
guarantees proper reset of flags and timeouts involved in the 
stabilization process of the main routine. 

Lemma 4.11: Suppose tg is a good W^-resynchronization 
point. Then 

(i) each node i € W switches to passive at a time ti S 
{tg + Ad, tg + (4z9 + 3)d) and observes itself in state 
dormant during [tg + Ad,Ti^i{ti)), 



(ii) Mcm,^jjom\[T,,i(u),t,„„\ = for all i,j e W, where 
tjoin > tg + M is the infimum of all times greater than 
tg — Ti — d when a node from W switches to join, 

(iii) Mem^j^s/ffp^vvflimj |[Ti,,(t,),t.] = for all z,j S W, 
where tg > tg + {l + l/'d)Ti is the infimum of all times 
greater or equal to tg when a node from W switches 
to sleep — > waking, 

(iv) no node from W resets its sleep — > waking flags during 
[tg + (1 + l/^)Ti,tg + Ril-d], and 

(v) no node from W resets its join flags due to switching 
to passive during [tg + (1 + l/-d)Ti,tg + Ri/-d]. 

Proof: All nodes in W switch to state supp resync 
during {tg , tg + 2d) and switch to state resync when their 
timeout of dAd expires, which does not happen until time 
tg + Ad. Once this timeout expired, they switch to state 
passive as soon as they observe themselves in state resync, 
i.e., by time tg + (4i? + 3d). Hence, every node i G W does 
not observe itself in state resync within [tg + 3d,Ti,i{ti)), 
and therefore is in state dormant during [tg + 3d, ^(ti)]. 
This implies that it observes itself in state dormant during 
[tg + Ad,Ti i{ti)), completing the proof of Statement (i). 

Moreover, from the definition of a good VK-resynchro- 
nization point we have that no nodes from W are in state 
join at times in [tg —Ti — d, tjoi„). Statement (ii) follows, as 
every node from W resets its join flags upon switching to 
state passive at time ti. 

Regarding Statement (iii), observe first that no nodes from 
W are in state sleep — > waking during (tg — d, tg + (1 + 
\l'd)Ti) for the following reason: By definition of a good 
W^-resynchronization point no node from W switches to 
sleep during (tg - Ag,tg) D {tg - (2i9 + l)Ti - 3ct,tg). 
Any node in W that is in states sleep or sleep — >■ waking 
at time tg — {2'd + l)Ti — id switches to state waking 
before time tg — d due to timeouts. Finally, any node in 
W switching to sleep at or after time tg will not switch to 
state sleep waking before time tg + (1 + l/i?)Ti. The 
observation follows. 

Since nodes in W reset their sleep waking flags at 
some time from 

[t^,n^^{t^)] C (tg + 3ct, tg + (4^9 + 4)d) 

f {tg + 3d,tg + {l + l/^)Tl), 

Statement (iii) follows. 

Statements (iv) and (v) follow from the fact that all nodes 
in W switch to state passive until time 

tg + (3 + Ad)d i tg + (^1 + j Tl - d, 

while timeout {Ri,supp — > resync) must expire first in order 
to switch to dormant and subsequently passive again. ■ 



Before we proceed with our third main statement showing 
eventual stabilization, we make a few more basic observa- 
tions. Firstly, if nodes do not make progress on the basic 
cycle, they must eventually switch to recover, i.e., the 
timeout conditions ensure detection of deadlocks. 

Lemma 4.12: For any time f and any node it holds that 
it must be in state recover or join or switch to sleep at some 
time from [<-, + (1 - l/i?)Ti + + T4 + Tg + 4d). 

Proof: Suppose a node is never in state recover or join 
during [t-,t- + l/i9)ri +T2 + Ti + Tz+ 4d). Thus it 
may follow transitions along the basic cycle only. Assume 
w.l.o.g. that the node switched to sleep right before time . 
Thus, it switched to state accept beforehand, no later than 
time f — Ti/d. Due to timeouts, either switch to recover 
at some point in time or switch to sleep, sleep — > waking, 
waking, ready, propose, accept, and finally sleep again. At 
each state, it takes less than d time until a respective timeout 
is started and it observes itself in the respective state. Hence, 
the node switches to recover or sleep before time 

Ti 

+Ti + Ts + Ti + 4d 



max{(2i9 + 2)Ti +3d,T2} 







T1+T2 + T4 + T5+ M, 



proving the claim of the lemma. ■ 
Secondly, after a good W^-resynchronization point tg, no 
node from W will switch to state join until either time 
tg +Tj/'d + Ad or Tq /-d time after the first non-faulty node 
switched to sleep — waking again after tg. By proper choice 
of T-! > Tg and Te, this will guarantee that nodes from W 
do not switch to join prematurely during the final steps of 
the stabilization process. 

Lemma 4.13: Suppose tg is a good VF-resynchronization 
point. Denote by tg the infimum of times greater than tg 
when a node in W switches to state sleep — > waking and by 
tjoin the infimum of times greater than tg ~ Ti — d when a 
node in W switches to state join. Then, starting from time 
tg + Ad, tr{recover,join) is not satisfied at any node in W 
until time 



T7 

mm <tg + — 



Ad, ts 



n 



> t 



'join ■ 



Proof: By Statements (ii) and (iii) of Lemma 4. 1 1 1 and 
Inequality ^ we have that t^ > tg+Ti+Ad > tg+{A{}+A)d 
Ad. Consider a node i E W not observing 

Ad, tyom] ■ 



and tjoin > tg 
itself in state dormant at some time t e [tg 



According to Statements (i) and (ii) of Lemma 4.11 the 
threshold condition of / + 1 nodes memorized in state 
join cannot be satisfied at such a node. By statements (i) 
and (iii) of the lemma, the threshold condition of /+1 nodes 
memorized in state sleep — > waking cannot be satisfied 
unless t > tg. Hence, if at time t a node from W satisfies that 
it observes itself in state active, we have that Tg expired after 



being reset after time tg, i.e., t > tg + T^/d. Moreover, by 



Statement (i) of Lemma 4.11 we have that if TV is expired 
at any node in W at time t, it holds that t > tg + Tyf-d- 



-Ad. 



Altogether, we conclude that tr {recover, join) is not satisfied 
at any node in W during 



tg + Ad, min 



to + 



Ad, tg 



In particular, i,o,„ must be larger than the upper boundary of 
this interval, concluding the proof. ■ 
Thirdly, after a good W^-resynchronization point, any node 
in W switches to recover or to sleep — > waking within 
bounded time, and all nodes in W doing the latter will do 
so in rough synchrony. Using the previous lemmas, we can 
show that this happens before the transition to join is enabled 
for any node. 

Lemma 4.14: Suppose tg is a good VF-resynchronization 
point and use the notation of Lemma 4.13 Define t+ := 
tg - Ti/i? + T2 + T4 + T5 + 3d and denote by t.ieep the 
infimum of all times greater than tg — Ag when a node in 
W switches to sleep. Then tsieep > tg and either 

(i) tsieep < t^ and at time tsieep + 2Ti + Ad, any node in W 
is either in one of the states sleep or sleep — )■ waking 
and observed in sleep or is in recover and also observed 
in recover, or 

(ii) all nodes in W are observed in state recover at time 
t+ + 2Ti + Ad. 

Proof By definition of a good resynchronization point, 
no node switches to slee p during (tg — Ag,tg), giving that 



sleep 



> t„. If t 



sleep 



< V 



Lemma 4.13 



yields that 



T7 Tg 
tjoin > min <j + — + 4d, tsieep + -j- 

Tg 



min <^ t+ + 2Ti + Ad, ts 



sleep 



''sleep 



2Ti + Ad. 



Therefore, by definition of a good resynchronization point, 
no nodes are in state join during {tg — Ti — d, tjoin) ^ {tsieep — 
Ti-d, t sleep +Ti+ Ad). RecalHng that during {t g-Ag, tsieep), 
no node is in state sleep, the preconditions of [Corollary 4.7| 
hold, implying Statement (i). 

If tsieep > t^, by definition of a good resynchronization 
point no node switched to sleep during {tg — Ag,t+) 
{tg — Ti — d. t+) and no node is in state join during {tg 

Tl d, tjoin) 



3 



By Lemma 4.13 



f T T 1 

tjoin > min hg + + Ad, tsieep + ^\ 



min <^ t+ + 2Ti + Ad, t^ 



Tg 



t+ + 2Ti + Ad. 



switched to state join at or after time tjoi„. Hence, all nodes 



Hence, Lemma 4.12 states that every node must be in state 
recover at some time in (ig — Ti — rf, <+). Since nodes do not in W will memorize them in state join by time t propose 



+ d 



leave state recover during {tg 
follows. ■ 
We have everything in place for proving that a good resyn- 
chronization point leads to stabilization within Ri/iS — 3d 
time. 

Theorem 4.15: Suppose tg is a good T4^-resynchroniza- 
tion point. Then there is a quasi-stabilization point during 

{tg,tg+R^/d~M]. 

Proof: For simplicity, assume during this proof that 
i?i — oo, i.e., by Statement (i) of Lemma 4.11 all nodes 



Ti — d, tjoin). Statement (ii) and thus have switched to state join. Hence, all nodes in 



in W observe themselves in states passive or active at times 
greater or equal to tg + [Ad + A)d. We will establish the 
existence of a quasi-stabilization point at a time larger than 
tg and show that it is upper bounded by tg + — Sd. 

Hence this assumption can be made w.l.o.g., as the existence 
of the quasi-stabilization point depends on the execution up 
to time tg + Ri/i} only, and i?i cannot expire before this 
time at any node in W. Moreover, by Statements (i) and (ii) 
of Lemma 4.11 every node satisfies Mem; ijo,„ = on 
[tg + (4?? + 4)d, ti join), where tijoi„ denotes the infimum of 
all times greater or equal to tg + {Ad + A)d when node i 
switches to join. During the time span considered in this 
proof, every node switches at most once to join, thus we 
may w.l.o.g. assume that Memi ijo,,, = is always satisfied 
in the following. We use the notation of Le mmas |4.13| and 



4.14 By Statements (ii) of [Lemma 4.1 1 and [Inequality \2\ 



we have that t., > tg + {I - l/?9)Ti >tg + (4i? + A)d. 



According to Lemma 4.1 1[ all nodes in W switched to 



state passive during {tg + 4d, + (3 + 4i?)d), implying that 
at any node in W, Tj will expire at some time from 



Tt/§ + Ad, tg+Tr^ 
{l + l/d)T,,tg+Tj 



(4i? + A)d) 
+ {A'd + A)d). 



Lemma 4.13 



By 

Statement (v) of 
after tg + {1 



thus tjoi„ > tg + {1 + l/-d)Ti, and by 



Lemma 4.11 no node resets its join flags 



l/-d)Ti again (before i?i expires). 
Case 1: Assume t sleep > t^- Thus, Statement (ii) of 



Lemma 4.14 applies, i.e., all nodes are observed in state 
recover by time t+ + 2Ti + Ad. Any node from W will 
switch to state join before time tg + Tj + {Ad + A)d because 
T7 expires no later than that. Subsequently, it will switch to 
propose as soon as it memorizes all non-faulty nodes in state 
join. Denote by tp^opose e {tg+Tr/d+Ad, tg+T7 + {A'd+5)d) 
the minimal time when a node from W switches from join to 
propose. Certainly, nodes in W do not switch from waking 



to ready during {t^ 
reset their join flags before time tp 



propose : ^propose 



2d) and therefore also not 
, + 3d. As nodes in W 



reset their propose and accept flags upon switching to state 
join, some node in W must memorize n — 2/ > / + 1 
non-faulty nodes in state join at time t propose- According 
to Statement (ii) of [Lemma 4. 11[ these nodes must have 



W will switch to state propose before time t 
subsequently to state accept before time t 



propose 



f 2d and 
3d, i.e., 

tpropose ^ tg + + {Ad + 5)d is a quasi-stabilization point. 
Case 2: Assume t sleep < t^. By Statement (i) of 



propose 



[Lemma 4.14[ all nodes are observed in either sleep or 
recover at time tsieep + STi + Ad. The nodes observed in 
state sleep will have been observed in state sleep — >■ waking 
and switched to waking by time tsieep + (2i? + 3)Ti + 5c?. 

Case 2a: Suppose < / + 1 nodes in W are observed in 
state sleep at time tsieep+^Ti+Ad, i.e., > n—2f > f+1 non- 
faulty nodes are observed in state recover. By [Lemma 4.13[ 
we have that 



tsieep + {2d + 3)Ti + 7d 

1 



2d + 3 



d 



Ti + 7d 



f minji, + (^2?$l + 3-^^ri + 7rf| 

_ mill <ts + —,tg + — +Ad> < tjoin. 

Hence, any node observing itself in state waking at some 
time t e {tsieep + 2Ti + Ad, tsieep + {2d + 3)ri + Qd) will 
also observe at least / + 1 nodes in state recover and switch 
to recover. As any node in sleep or sleep — )■ waking at time 
tsieep + 2Ti + Ad will observe itself in state waking no later 
than time tsieep + {2d + 3)ri + 6)d, by time tsieep + {2d + 
3)Ti + Id < tjoin, all nodes observe themselves in state 
recover. From here we can argue analogously to the first 
case, i.e., there exists a quasi-stabilization point tpropose < 
tg + TT + {Ad + b)d. 

Case 2b: Suppose > / + 1 nodes in W are observed in 
state sleep at time tsieep + 2Ti + Ad. These nodes will switch 
to waking and subsequently ready until time 

max \tsieep + {2d + 3)Ti + 6d, tsieep -^+T2+d^ 



^ t + T +d 

— i^sleep ^ + J 2 + " 



(27) 



due to T2 being expired while observing themselves in 
waking unless they switch from waking to recover. Note 
that these nodes reset their accept flags upon switching to 
waking. Denote by tpropose and tgccepr the infima of times 
greater than tsieep+2Ti+Ad when a node switches to propose 
or accept, respectively. Recall that any node switching from 
recover to join resets its propose and accept flags, and any 
node switching from waking to ready resets its propose flags. 
Hence, we have for all i,j S W that 

(i) Mem^ J ^propose {t) = at any time t e [tsieep + 2Ti + 
4d, tpropose] when i observes itself in ready or join, and 



(ii) Meuii J ^accept {t) = at any time t e %ieep + ZTi + 
4(i, taccept] when i observes itself in ready, join, or 
propose. 

By Statements (ii) and (iv) of Lemma 4.11 no node from 
W resets its sleep — > waking flags at or after time tg > 
tg + {1 + l/'d)Ti. As ts > tsieep + 2Ti + Ad and afl nodes 
observed in sleep at time tsieep + ^Ti+Ad will be observed in 
sleep — > waking hy time tsigt,p+{'2'd+3)Ti+5d, Statement (i) 
of the lemma implies that all nodes in W switch to active at 
some time from {ts,tsieep + {'2^+3)Ti+5d) C {tsieep + 2Ti + 
dd, tsieep + i'2'&+'i)Ti+5d). As, by the Statements (i) and (ii) 
from above, the first node switching to state propose must 
do so because of an expiring timeout, |Lemma 4.13| yields 
that 



propose 



> mill <^tjom, t sleep — Ti — d 

> minjtg + !y +4rf,t 

mill "I t sleep — 
isleep ~ ^ 

^sleep ^1 d 

1 

2- - 



T2 

T2 + minjTs, 
^9 



t 



sleep 



Therefore, 



^propose ^ ^sleep 



2 - 



i9 



i9 - 





T7 
I? 


ri + 


^6 


7^2 + 










^6 



sleep 



By Inequality ( p7] i we conclude that at time tj/^cp 
T2 + 2(i < tpropose, ™y node from observes itself in one 
of the states ready, recover, or join. 
Again, we distinguish two cases. 

Case 2b-I: tpy„pose < t sleep - Ti - d + (Ts + Tg)/!?. As 

previously used, no node can switch from ready to propose 
during {tsieep+2Ti+^d, tsieep-Ti-d+{T2+T3.)/d)). Hence, 
there must be a node that switches from join to propose at 
time tpropose- By Statements (i) and (ii) from above, the node t„ + id. 



T2 + 2d. 
(28) 



must memorize at least n— 2/ > /+1 nodes from W in state 
join at time tpmpose- By Statement (ii) of Lemma 4.11 these 
nodes must have switched to join at or after time tj„i„. By 
Statements (iii) and (v) of the lemma, no node resets its join 



propose 1 t propose 



flags during [tg+{l+l/d)Ti,tg+Ri/d) D [t 
3d) unless it switches to state join. Hence, all nodes still in 
state recover have switched to join by time tpropose+d, giving 
that all nodes are in one of the states ready, join, or accept 
at time tpmpose + d (since they cannot leave accept earlier 
than tpmpose +Ti/-d> tpmpose + 4cf again). 



Case 2b-II: tpmpose > tsieep-Ti-d+{T2+T3)/i!). Recall 
that all nodes switched to active by time tsigt,p + {^d + 3)Ti + 
5d. Hence, any node observing itself in state recover at time 
tsieep+2Ti+4:d will have switched to join because Tg expired 
by time 



^sleep 
^si 



- [2d 
Ti- 



ed 



< t„ 



(29) 



3)ri + Te - 
,T2+T^ 

t-sleep — 1 — a H ^ 

Hence, also in this case all nodes are in one of the states 
ready, join, or accept at time tpmpose + d. 

Continuing Case 2b: Next, we claim that any node is 
in states propose or join by time tsieep — Ti /i? + T2 + T4 + 
2d. To see this, observe that any node following the basic 
cycle must switch from ready to propose by this time due to 



timeouts. On the other hand, according to Inequality (29i 



all nodes in state recover switch to join by time 



sleep 



Ti-d 



T2+T3 ^ 



sleep 



T2 + n + 2d, 



showing the claim. 

In summary, we showed the following points: 

(i) At time tpmpose + d, all nodes are observed in states 
ready, join, propose, or accept. 

(ii) All nodes switch to states propose or join during 



\tjoifi^ tpmpose} : ^. 



sleep 



Ti/d + T2+n + 2d). 



(iii) No node resets its propose or accept flags at or after 



time t 



propose 



d unless switching to accept first. 



(iv) No node memorizes nodes in state propose or accept 



that have not been in that state at or after time t, 
We claim that the infimum t„ of all times from 



propose ■ 



tpropose 5 tsieep 



Ti 
i9 



T2+Ti 



when a node switches to accept is a quasi-stabihzation point. 
Note that because 



t 



sleep 



sleep 



propose 



Ti+Ad 



T2 + r4 
T5- 



-4rf 



no node will switch from propose to recover before time 



Again, we distinguish two cases. First assume that tq < 
tsieep - Ti/-d + T2 + Ti + 2d, i.e., at time tq indeed a 
node switches to state accept. Due to Statement (iv) from 
the above list and the minimality of tq, it follows that the 
respective node memorizes n — 2f > f + 1 nodes from W 
in state propose that switched to propose at or after time tp. 
These nodes must be in one of the states propose or accept 
during [tq, tq + 3d]. According to Statement (i) from above, 
thus all nodes still in ready will switch to propose by time 
tq + d. By time tq + 2d, all nodes in join will observe the at 



least n — / nodes from W in one of the states join, propose, 
or accept, and hence switch to propose. Another d time later, 
all nodes will have switched to accept, i.e., tq is indeed a 
quasi-stabilization point. 



On the other hand, if t„ 



t. 



sleep 



Ti/-& + T2+Ti + 2d, 



Statement (ii) from the above list gives that all nodes from 
W are in one of the states join, propose, or accept during 
[tq + d,tq + 3d] . Therefore, nodes will switch from join to 
propose and subsequently from propose to accept until time 



tq + 3d as well. 



It remains to check that in all cases, the obtained quasi- 
synchronization point tq occurs no later than time tg + 
Ri/d — 3d. In Cases 1 and 2a, we have that 



tq — tpj-QpQse — tg 



- Tr + {M + 5)d ^ tg 



3d. 



In Case 2b, it holds that 

_ T\ 

f'skep ^ 



tn 



< t,, 



< P 



tn + 



Ri 



T2 + T4 ■ 

n + 



2T2 + 2T4 + T5 + M 



3d. 



We conclude that indeed all nodes in W switch to accept 
within a window of less than 3d time before, at any node 
in W, Ri expires and it leaves state resync, concluding the 
proof ■ 
Finally, putting together the established main theorems 



and Lemma 3.4 we deduce that the system will stabilize 
from an arbitrary initial state provided that a subset of n — / 
nodes remains coherent for a sufficiently large period of 
time. 

Corollary 4.16: Let W Q V, where \W\ > n ~ f, and 
define for any fc £ N 

T{k) := {k + 2){d{R2 + 3d) + 8(1 - A)i?2 + d) + Ri/d. 

Then, for any k E N, the proposed algorithm is a (W, W'^)- 
stabilizing pulse synchronization protocol with skew 2d and 
accuracy bounds {T2+T3) /d-2d and T2+r4+7d stabilizing 
within time T{k) with probabiHty at least 1 - 2~'''^"~f \ It 
is feasible to pick timeouts such that T{k) e 0{kn) and 
T2 + Ti + 7deO(l). 



Proof: The satisfiability of Condition 3.3 with T{k) e 
0{kn) and T2 + r4 + 7d e 0(1) follows from [LemmaTi 
Assume that i+ is sufficiently large for [t^ +T{k)+2d,t'^] to 
be non-empty, as otherwise nothing is to show. By definition, 
W will be coherent during [t'^,t+], with t^ = t^ +-d{R2 + 
3d)+8{l~X)R2 + d. According to |Theorein 4.10[ there will 
be some good VF-resynchronization point tg E [t^, t^ + (fc+ 
l)(i9(i?2 + 3d) + 8(1 - A)i?2 + d )] with probabil ity at least 



1-1/2 



Hn-f) 



If this is the case. 



Theorem 4.15 



shows that 



[Theorem 4.4 inductively, we derive that the algorithm is a 
{W, M^'^)-stabilizing pulse synchronization protocol with the 
bounds as stated in the corollary that stabilizes within time 
T{k) with probabihty at least 1 - i/2'=("-/). ■ 

D. Late Joining and Fast Recovery 

An important aspect of combining self-stabilization with 
Byzantine fault-tolerance is that the system can remain 
operational when facing a limited number of transient faults. 
If the affected components stabilize quickly enough, this 
can prevent future faults from causing system failure. In 
an environment where transient faults occur according to a 
random distribution that is not too far from being uniform 
(i.e., one deals not primarily with bursts), the mean time until 
failure is therefore determined by the time it takes to recover 
from transient faults. Thus, it is of significant interest that a 
node that starts functioning according to the specifications 
again synchronizes as fast as possible to an existing subset 
of correct nodes. Moreover, it is of interest that a node that 
has been shut down temporarily, e.g. for maintenance, can 
join the operational system again quickly. 

Theorem 4.17: Suppose there exists a node i m V and 
a set C V, \W\ > n — f, such that there is a 
W^-stabilization point at some time t~ and W U {i} is 
coherent during [t^,t^ + (1 + 5/(2i?))i?i]. Then there is 
a {W U {i})-stabilization point at some time t E [t^,t^ + 
(l + 5/(2z?))i?i]. 

Proof: Again, the proof is executed by distinguishing 
cases. W.l.o.g., we assume for the moment that W U {i} 
is coherent during [t^,oo) and later show that indeed t < 
t- + (l + 5/(2i9))i?i. 

Case 1: Node i does not switch to supp resync during 
[t-,t- + i?i + (?9 - l)Ti + 2T2 + 2Ti + 18d]. Thus, after 
Ri expires at the latest by time t^ + Ri + d, it will observe 
itself in dormant during [t^ + Ri + 2d, t^ + Ri + {"d — 
l)Ti + 2r2 + 2T'4 + 18d] and therefore not be (or observe 
itself) in state join during [t^ + Ri + 3d, t^ + Ri + {-d — 
l)Ti + 2T2 + 2r4 + 18d]. By [Theorem 4.4] there is a W- 
stabilization point tw e [t^ + i?i + (?9 - l)Ti + Ad,t^ + 
Ri + {d- l)Ti +T2+Ti + 9d). Subsequently, the nodes in 
W will switch to sleep during [tw + Ti/d, tw +Ti + 5d]. 
Denote by t sleep the minimum of the respective times. We 
apply Lemma 4.5 to VKU{i}. Thus, at time t sleep + '2Ti + 3d, 



there is a M^-stabilization point t E [tg,t +T{k)]. Applying 



node i is either in state recover and will not leave until the 
next Ty-stabilization point (or it switches to join), or it is 
in state sleep and reset its timeout T2 at some time from 
[t^ -Ag + Ti/^^ 4d, t^ + i3- ll'd)Ti + 8d]. 

Case la: Node i is in recover at time tsUep + 2Ti + 3d. As 
it cannot switch to join until time t~ +Ri + {d-l)Ti+2T2 + 
2T4 + 18d, it will stay in recover until the subsequent W- 
stabilization point t'y^ E {tw + {T2+Tz)/-d, t - +Ri + {-d- 
l)Ti+2r2+2T4 + 14d) (existing according to |TheoremT4 i. 
By time t'^, cleai^ly timeout [recover, §{2X1 + 3d)) has 



expired at the node, as 



T2 + T3 



^ sleep 

+ (i9 + l)(2Ti+3d). 



Because TiJ-d > Ad, i will observe all nodes from W in 
accept during [t'^ + 3d, t'^, + 4d] . Hence it will switch 
to accept by time t'^r + id, i.e., i'^^ is a U {i} quasi- 
stabilization point. 

Case lb: Node i is in sleep at time t^ieep + 2Ti + id. 
Denote by the M^-stabilization point subsequent to tw 
as in the previous case. As no node from W is observed in 
state accept or recover during [t^ + 2Ti +3(i, t'y^) and i reset 
its timeout T2 no earlier than time ty/ — Ag +Ti/d — Ad, it 
will not switch to recover before time min{t'yj,., t\Y — Ag + 
(Ti + r2 + T3 + T^)/'^ - 4d} unless it switches to accept 
first. However, as it resets its propose and accept flags before 
switching to ready, it cannot switch to accept before at least 
/ nodes from W switched to propose (unless switching to 
recover first). Moreover, by time t'-^r, it will already have 
switched to ready since 



— ^vv + 



> u 



T2 



8d. 



Hence, reasoning analogously to the proof of [Theorem 4.4[ 
t'^ is in fact a U {i}-stabilization point provided that 
i switches to accept instead of recover first. This in turn 
follows from the bound 



tw — A„ 



T1+T2+T3 + n 



Ad 



tw - Ag + Ti+T2+T4 + 1 - Ad 

tw-Ag + {2d + b)Ti + T2 + T4 + 3d 

tw+T2 + Ti + Id 
> + 2d, 

where in the last step we used that < tw +T2 + T4 - 



5d according to Theorem 4.4 This shows that T5 does not 
expire at i while it is in propose before time t'^y + 2d. Hence, 
t'^ is a U {i}-stabilization point. 

Case 2: Node i switches to supp — > resync at a time 
t' e[t-,t + Ri + {i9 - l)Ti + 2T2 + 2Ti + 18d]. Denote 
by tw and t'-^ the maximal 14^-stabilization point smaller 
than t' and the minimal W^-stabilization point larger than 
max{t', tw + 2d}, which exist by Theorem 4.4 Denote by 
tsieep the minimal time larger than tw when a node from 
W switches to sleep. Analogousl}p"| to Case lb, t'^^ is a 
T4^U{i}-stabilization point if i is in state sleep at time tsieep+ 
2Ti +3d. Hence, assume w.l.o.g. that i is in state recover or 
akeady switched to join by this time. Analogously to Case 
la, t\y will be a W^U {i}-quasi-stabilization point if it stays 
in recover until time + 3d. Therefore, w.l.o.g., i switches 



Note that we can apply 



Lemma 4.5 



can simply replace the set Aby W. 



even if i switches to join, as we 



to join at some time during (t',t'y^ + 3d), implying that it 
will leave the state no later than time t\^ + Ad and switch 



to state accept by time t[ 



w 



5d. 



Now either i continues to execute the basic cycle and thus 
will, analogously to Case lb, participate in the minimal W- 
stabilization point t'^ > t'y^, + 2d, or it will switch to recover 
again. In the latter case, it cannot switch back to join until 
at least time t' + Ri/-d because it needs to reset its join flags 
first, which happens upon switching to passive only. As we 
have that 

H (2} 2T 

t' + ^ > t' - -r^ + 2r2 + 2T4 + Ts + Id 
V V 

, 2Ti 

V 



t' + 2T2 + 2Ta + 14d 



> t'{^ + 4d, 



i cannot leave state recover through join again before time 
i'w + Therefore, t'-^ is a U {i}-quasi-stabilization 
point, analogously to Case la. 

We have shown that there is some W U {z}-quasi- 
stabilization at the latest by time 



w 



< t' + 2T2 + 2T4 + lOd 

< t- +Ri + {'d- l)Ti + AT2 + 4T4 



28d 



in Case 2, while in Case 1 there is a quasi-stabilization point 
no later than time t- +Ri + {-d- l)Ti + 2T2 + 2T4 + 18d. 



By Theorem 4.4 this implies a U {i}-stabilization point 
by time 



t- +Ri + {d-l)Ti+bT2 + bTi + 23d < (^1 + ^i' 

where the estimate is obtained analogously to the bound 
t' + Ri/S > t'^ + Ad shown above. This concludes the 
proof, as indeed there is a U {?;}-stabilization point no 
later than time + (1 + 5/(2i9))i?i. ■ 



V. Generalizations 

This section provides a few extensions of the core results 
derived in the previous section. In particular, we show that 
it is not necessary to map faulty channels to, for example, 
faulty nodes (thus rendering a non-faulty node effectively 
faulty in terms of results), that the algorithm can tolerate an 



even stronger adversary than defined in Section II without 
significant change of stabilization time, and that in many 
reasonable setting stabilization takes 0{Ri) time only, even 
if there is no majority of non-faulty nodes that is already 
synchronized. With the exception of |Corollary 5.6[ we again 
follow |,30J during this section. 



A. Synchronization Despite Faulty Channels 



that 



|Theorem 4. 15| and our notion of coherency require that 
all involved nodes are connected by correct channels only. 
However, it is desirable that non-faulty nodes synchronize 
even if they are not connected by correct channels. To 
capture this, the notions of coherency and stability can be 
generalized as follows. 

Definition 5.1 (Weak Coherency): We call the set C C y 
weakly coherent during [t^, iff for any node i E C there 
is a subset C C C that contains i, has size n — f, and is 
coherent during [t^,t^]. 

In particular, if there are in total at most / nodes that are 
faulty or have faulty outgoing channels, then the set of non- 
faulty nodes is (after some amount of time) weakly coherent. 

Corollary 5.2: For each fc e N let T'{k) := T{k) - 
((i?(i?2 + 3d) + 8(1 - A)i?2 + d)), where T(fc) is defined 
as in [Corollary 4.16[ Suppose the subset of nodes C Q V 
is weakly coherent during the time interval [t^,i+] D 
[t- + T'{k) +T2 + Ti + 8d, t+] ^ 0. Then, with probabiHty 
at least 1— (/ + l)/2'^("^-^\ there is a C-quasi-stabilization 
point t <t^ + T'{k) + T2 + r4 + 5(i such that the system 
is weakly C-coherent during 

Proof: By the definition of weak coherency, every node 
in C is in some coherent set C" C C of size n ~ f. Hence, 
for any such C" it holds that we can cover all nodes in C by 
at most l + |y\C"| < /+1 coherent sets Ci, . . . , C/+i C C. 



By Corollary 4.16| and the union bound, with probability at 
least 1 — (/ + l)/2'^("^-'\ for each of these sets there will 
be at least one stabiUzation point during [i", + T'{k) — 
(T2 + r4 + 5(i)]. Assuming that this is indeed true, denote 

by K e [i", + T'{k) - (T2 +Ti + 5d)] the time 

max {maxjt < +T'(fc) - (Ts + + 5d) 

iG{l,.. .,/+!}' 

1 1 is a Ci-stabilization point}}, 

where G {1, ...,/ + 1} is an index for which the first 
maximum is attained and is the respective maximal time, 
i.e., is a -stabilization point. 

Define t e {U„ +2d, +T'{k)] to be minimal such that 
it is another -stabilization point. Such a time must exist 



by Theorem 4.4 Since the theorem also states that no node 
from Ci,, switches to state accept during [ti^, + 2(i, ) and 
Ci n Ci„ 7^ 0, there can be no Ci-stabilization point during 
iU, + 2d, - 2d) for any z e {1, . . . , / + 1}. Applying 
the theorem once more, we see that there are also no Ci- 
stabiUzation points during {t[^^ + 2d, t[^^ + (T2 + Ta)/^?) - 2d 
for any i e {!,...,/+!}. On the other hand, the maximality 
of implies that every Ci had a stabilization point by time 



tip. Applying Theorem 4.4 to the latest stabilization point 



until time ti^ for each Ci, we see that it must have another 
StabiUzation point before time ti^ + r2 + T4 + 5d. We have 



2d 



5d, 



i9 ?9 
i.e., all Ci have stabilization points within a short time 
interval of {t'^^ — 2(1,1^^ + 2d). Arguing analogously about 
the previous stabilization points of the sets Ci (which exist 
because ti„ is maximal), we infer that all Ci had their 
previous stabilization point during {tig — 2d,ti„ + 2d). 

Now suppose ta is the minimal time in (t'^^ — 2d, t'^^ + 2d) 
when a node from C switches to accept and this node is in 
set Ci for some ie {!,...,/+!}. As usual, there must be at 
least / + 1 non-faulty nodes from Ci in state propose at time 
ta and by time ta + d, all nodes from Ci will be observed in 
either of the states propose or accept. As |Ci n Cj | > / + 1 
for any j S {1, ...,/ + 1}, all nodes in Cj will observe 
at least / + 1 nodes in states propose or accept at times in 
{ta.ta + 2d) We have tha t ta > t,„ + {T2 + n)/d - 2d 
according to [Theorem 4.4| As no nodes switched to state 
accept during {ti„ + 2d,ta) and none of them switch to 
state recover (cf. [Theorem 4.4 1, for any j we can bound 

, T2+T3 



ta + d > ti 



> t.i - 3d + 



T2 + n 



t, + + 3d 



that all nodes from Cj are in one of the states ready, propose, 
or accept at time ta + d. Hence, they will switch from ready 
to propose if they still are in ready before time ta + 2d. Less 
than d time later, all nodes in Cj will memorize all nodes 
in Cj in state propose and therefore switch to accept if not 
done so yet. Since j was arbitrary, it follows that ta is a 
C-quasi-stabilization point. ■ 
Corollary 5.3: Suppose C C V is weakly coherent dur- 
ing [t-,t+] and t e [t^,t+ - {T2 + T4 + 8d)] is a C-quasi- 
stabilization point. Then 

(i) all nodes from C switch to accept exactly once within 
[t,t + 3d); 

(ii) there will be a C-quasi-stabilization point t' E [t + 
(T2 + Ta)/!?, i + T2 + T4 + 5d) satisfying that no nodes 
switch to accept in the time interval [t + 3d, t'); 

(iii) and each node i's, i G C, main state machine ( [Figure 1 1 
is metastability-free during [t + 4d, t' + Ad). 



Proof: Analogously to the proofs of Theorem 4.4 and 



Corollary 5.2 



We point out that one cannot get stronger results by the 
proposed technique. Even if there are merely / + 1 failing 
channels, this can e.g. effectively render a node faulty (as 
it may never see n — f nodes in states propose or accept) 
or exclude the existence of a coherent set of size n — f 
(if the channels connect / + 1 disjoint pairs of nodes, 
there can be no subset of n — / nodes whose induced 
subgraph contains correct channels only). Stronger resilience 



to channel faults would necessitate to propagate information 
over several hops in a fault-tolerant manner, imposing larger 
bounds on timeouts and weaker synchronization guarantees. 



Combination of Corollary 5.2 and Corollary 5.3 finally 
yields: 

Corollary 5.4: Let C C be such that, for each i E C, 
there is a set Ci Q C with \Ci\ = n — f, and let E = 
[Ji^(jCf. Then, for any k E N, the proposed algorithm is a 
(C, i?) -stabilizing pulse synchronization protocol with skew 
M and accuracy bounds (Ta + Ta)/-!? - 3d and T2+T4 + 8d 
stabilizing within time T{k) + T2 + 14 + 5d with probability 
at least 1 - (/ + l)/2'=("-^). 



Proof: Analogously to the proof of Corollary 4.16 



B. Stronger Adversary 

So far, our analysis considered a fixed set C of coherent 
(or weakly coherent) nodes. But what happens if whether 
a node becomes faulty or not is not determined upfront, 
but depends on the execution? Phrased differently, does the 
algorithm still stabilize quickly with a large probability if an 
adversary may "corrupt" up to / nodes, but may decide on 
its choices as time progresses, fully aware of what happened 
so far? Since we operate in a system where all operations 
take positive time, it might even be the case that a node 
might fail just when it is about to perform a certain state 
transition, and would not have done so if the execution had 
proceeded differently. Due to the way we use randomization, 
this however makes little difference for the stabilization 
properties of the algorithm. 

Corollary 5.5: Suppose at every time t, an adversary has 
full knowledge of the state of the system up to and including 
time t, and it might decide on in total up to / nodes (or 
all channels originating from a node) becoming faulty at 
arbitrary times. If it picks a node at time t, it fully controls 
its actions after and including time t. Furthermore, it controls 
delays and clock drifts of non-faulty components within 
the system specifications, and it initializes the system in an 
arbitrary state at time 0. For any k E N, define as 

(/c + 3) (i9 (i?2 + 3d) + 8 ( 1 - A) i?2 + rf) + i?i + T2 + r4 + 5d. 

Then the set of all nodes that remain non-faulty until time 
tfc reaches a quasi-stabilization point during [E^jtk] with 
probability at least 

l-(/ + l)e-'=("-/)/2. 

Moreover, at any time t > E3, the set of nodes that are 
non-faulty at time t is coherent. 

Proof: The last statement of the corollary holds by 
definition. 

We need to show that lTheorem 4. lOl holds for the modified 
time interval [E3, (fc + 3)i?3] with the modified probability 
of at least 1 — e^'^'^"^^)/^. If this is the case, we can proceed 



We start to track the execution from time E^. Whenever 
a non-faulty node switches to state init at a good time, the 
adversary must corrupt it in order to prevent subsequent 
deterministic stabilization. In the proof of [Theorem 4.10| 
we showed that for any non-faulty node, there are at least 
fc+1 different times during [£^3, (fc + 3)£'3] when it switches 
to init that have an independently by 1/2 lower bounded 



probability to be good. Since Lemma 4.9 holds for any 
execution where we have at most / faults, the adversary 
corrupting some node at time t affects the current and future 
trials of that node only, while the statement still holds true 
for the non-corrupted nodes. Thus, the probability that the 
adversary may prevent the system from stabilizing until time 
tk is upper bounded by the probability that (fc + l)(n — /) 
independent and unbiased coin flips show / or less times tail. 
Chernoff's bound states for the random variable X counting 
the number of tails in this random experiment that for any 
<5e (0,1), 



P[X < (1 - S)E[X]] < 



E\X] 



< e 



-SE[X] 



as in Corollaries 



and 153 



Inserting 6 = k/{k + 1) and E[X] = {k + l){n - /)/2, we 
see that the probability that 

P[X < /] < P[X <{n- f)/2] < e-'=("-/)/2, 

as claimed. ■ 

C. Constant-Time Stabilization 

Up to now, we considered worst-case scenarios only. In 
practice, it is likely that faulty nodes show not entirely 
arbitrary behavior In particular, they might still be partially 
following the protocol, not exhibit a level of coordination 
that could only be achieved by a powerful central instance, 
or not be fully aware of non-faulty nodes states. Moreover, 
it is unlikely that at the time when a majority of the nodes 
becomes non-faulty, all their timeouts R2 and have been 
reset recently. In such settings, stabilization will be much 
easier and therefore be achieved in constant time with a large 
probability. It is difficult, however, to name simple condi- 
tions that cover most reasonable cases. Generally speaking, 
once (randomized) timeouts of duration R2 or R3 are not 
"messed up" at non-faulty nodes anymore, faulty channels 
and nodes need to collaborate in an organized manner in 
order to prevent stabilization for a large time period. We 
give a few examples in the following corollary. 

Corollary 5.6: Suppose W C V, where \W\ > n — f, 
satisfies that for each i E W, all (randomized) timeouts of 
duration R2 or R3 are correct during [t^ , and the node 
is non-faulty during [t- + z?(3(i?2 + 3d) + 2(8(1 - A)i?2 + 
d)),t'^]. Moreover, channels between nodes in W are correct 
during [t- + i9(3(i?2 + 3d) + 2(8(1 - A)i?2 + d)),t+] and 
did not insert init signals that have not been sent during 
[<-,i-+??(3(i?2 + 3d) + 2(8(1 -A)i?2 + d))] or delay them 
by more than Ri time. Define :— + i?(3(i?2 + 3d) + 



2(8(1 — A)i?2 + d)) + -Ri + d. Moreover, assume that one 
of the following statements holds during 

(i) Nodes in V \ W switch to init at times that are 
independently distributed with probability density at 
most 0{l/{Rin)), and channels from V \ W to W 
do not generate init signals on their own (or delay init 
signals from before i~ more than i?i time). 

(ii) Channels from V\W to W switch to init at times that 
are independently distributed with probability density 
at most 0{l/{Rin^)). 

(iii) Channels from y \ to switch to init obliviously 
of the history of signals originating at nodes in W and 
do not know the time t^. 

If <+ e i^+n{kRi), k gN, then there is a PF-stabilization 
point during [t^^t^ +0{kRi)] with probability at least 1 — 

Proof: In Theorems |4.4| and |4.15[ we showed that 
stabilization is deterministic once a good resynchronization 
point occurs. The notion of coherency essentially states that 
at non-faulty nodes, each timeout expired at least once and 
has not been reset again because of incorrect observations on 
other non-faulty nodes' states until the set is considered co- 



herent (cf. Lemma 4.3 i. Subsequently, the respective nodes 
are non-faulty and the channels connecting them correct. 
This is true by the prerequisites of the corollary, which 
essentially state respective conditions on timeouts i?2 and i?3 
explicitly, while rephrasing the conditions for coherency for 
the remaining timeouts (note that Ri is the largest timeout 
except for i?2 and R^). 

Moreover, the time span during which i?2 and R^ behave 
and are observed regularly is large enough for i?3 to expire 
twice and additional R2 + 3d time to pass. This accounts for 
the fact that in the proof of [Theorem 4.10[ we essentially first 
wait until R^ expires once (so the adversary has no useful 
information on the timeout at the respective node anymore) 
and then consider the subsequent time(s) when it expire(s). 
The proof then exploits that non-faulty nodes timeout i?3 
will expire at roughly independently uniformly distributed 
points in time. Therefore, unless faulty nodes or channels 
interfere, the statement of the corollary holds. 

Hence, we need to show that for any of the three condi- 
tions, there is not too much meddling from outside W. For 
Conditions (i) or (ii), we see that the probability that there 
are no init signals on channels from to W at all for any 

time span of length 0{Ri) is at least constant, regardless 
of the time interval considered. Regarding Condition (iii). 



recall that Theorem 4.10 essentially shows that whatever the 
strategy of the adversary, the expected number of good W- 
resynchronization points during a time interval (where W 
is coherent) is linear in the size of the interval divided by 
Ri if the interval is sufficiently large. Since the adversary is 
oblivious of the current time in relation to t+ and the state 
of W, the statement that for any strategy of the adversary 



the amortized number of good stabilization points per Ri 
time is constant yields the claim of the lemma. ■ 
We remark that this observation is particularly interesting 
as the core routine of the algorithm is independent of the 
resynchronization routine after stabilization. If at some time 
W becomes subject to a large number of faults resulting 
in loss of synchronization, however the resynchronization 
routine still works properly, it is very likely that W will 
recover within 0(1) time (provided i?i e 0(1)). On the 
other hand, if the resynchronization routine fails in the sense 
that a majority of the nodes suffers from faulty timeouts 
i?2 or i?3, or communication is faulty between too many 
nodes, this will not affect the core routine unless too many 
components related to it fail as well. 

VI. The FATAL+ Protocol 

The synchronized pulses established by the FATAL pulse 
synchronization algorithm could in principle serve as the 
local clock signals provided to the application layer of the 
SoC|^ However, just using the FATAL protocol in this way 
would result in a very low clock frequency: Despite the fact 
that the time between pulses is &{d) (if the timeouts are 
chosen accordingly) and thus asymptotically optimal, the 
actual clock speed would be several orders of magnitude 
below the upper bound resulting from ll42l . due to the large 
implied constants. Moreover, the system model introduced in 



Section II assumes that delays may vary arbitrarily between 
and d, with d also covering the fairly complex imple- 
mentation of communicating the main algorithm's states 
(see [Section VII-B| i. By contrast, pure wire delays of the 
communication channels between different nodes are much 
smaller and also vary within a smaller range. 

This section contains an extension of FATAL, termed 
FATAL+, which overcomes these limitations. In a nutshell, 
it consists of adding a fast non-self-stabilizing, Byzantine- 
tolerant algorithm termed quick cycle to FATAL, which 
generates exactly M > 1 fast clock ticks between any two 
pulses at a correct node after stabilization. 

The Quick Cycle Algorithm 

Consider a system of n nodes, each of which runs the 
FATAL pulse synchronization protocol. Additionally, each 
node is equipped with an instance of the quick cycle state 



machine depicted in Figure 5 The interface between the 
quick cycle algorithm and the underlying FATAL pulse 
synchronization protocol is by means of two signals only, 
one for each direction of the communication: (i) The quick 
cycle state machine generates the NEXT signal by which 
it (weakly) influences the time between two successive 

^'in order to establish a consistent global tick numbering (needed for 
establishing a global notion of time across different clock domains) of 
arbitrarily large bounded clocks, a self-stabilizing digital clock synchro- 
nization algorithm like the one from 1251 can be employed. Implementing 
such algorithms in SoCs is part of our future work and thus outside the 
scope of this paper, however. 
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Figure 5. The quick cycle of the FATAL+ protocol. 



pulses generated by FATAL, and (ii) it observes the state 
of the {T2, accept) signal, which signals the expiration 
of an additional timer added to the FATAL protocol. The 
timer is coupled to the state accept of FATAL, in which the 
pulse synchronization algorithm generates a new pulse. The 
signal's purpose is to enforce a consistent reset of the quick 
cycle state machine once FATAL has stabilized. 

Essentially, the quick cycle state machine is a copy of 
the outer cycle of [Figure 2] that is stripped down to the 
minimum. However, an additional mechanism is introduced 
in order to ensure stabilization, namely, some coupling to 
the accept state of the main algorithm: Whenever a pulse is 
generated by FATAL, we require that all nodes switch to the 
accept~^ state unless they already occupy that state. This is 
easily achieved by incorporating the state of the expiration 
signal of the additional FATAL timer {T^, accept) in the 



guards of Figure 5 Since pulses are synchronized up to 



the skew E of the pulse synchronization routine, it follows 
that all nodes switch to accept^ within a time window of 
S + 2c?. Subsequently, all nodes will switch to state ready^ 
before the first one switches to propose^ provided that Tg^ 
is sufficiently large, and the condition that / + 1 propose^ 
signals trigger switching to propose^ guarantees that all 
nodes switch to accept^ in a tightly synchronized fashion. 

is 



One element that is not depicted explicitly in Figure 5 



that nodes increase an integer cycle counter by one whenever 
they switch to accept^ . The counter is reset to zero whenever 
{T^, accept) expires, i.e., shortly after a pulse generated 
by the underlying pulse synchronization algorithm. The 
algorithm makes sure that, once the compound algorithm 
stabilized, these resets never happen when the counter holds 
a non-zero value. The counter operates mod M E N, where 
M is large enough so that at least roughly T2 + T3 and at 
most {T2 + T4)/-d time passed since the most recent pulse 
when it reaches M = again. Whenever the counter is set 
to 0, node i G V will set its NEXT^ signal to 1 and switch it 
back to at once (thus raising the respective NEXT; memory 
flag of the main algorithm). Thus, by actively triggering the 
next pulse, we ensure that a pulse does not occur at an 



inconvenient point in time: When the system has stabilized, 
exactly M switches to accept~^ of the quick cycle algorithm 
occur between any two consecutive pulses at a correct node. 
As these switches occur also synchronously at different 
nodes, it is apparent that the quick cycle state machine in 
fact implements a bounded-size synchronized clock. 

To derive accurate bounds on the skew of the protocol, 
we need to state the involved delays more carefully. 

Definition 6.1 (Refined Delay Bounds): The state of the 
quick cycle algorithm is communicated via separate channels 
Sf^, with i,j G V, whose delays vary within d^^^ and 
d+a^x in order to be considered correct during [f^, t+]. State 
transitions of the quick cycle state machine, resets of its 
timeouts, and clearance of its memory flags take at most 
<ax time. 

Setting E+ := 2^+^^^— d+j^^, we assume that the following 
timing constraints hold: 



3 
M 



> 
> 
> 



T++T + 



' T++T+ + S + +3d,t 



(30) 
(31) 
(32) 
(33) 



It follows from [Lemma 3.4| that it is always possible to pick 
appropriate values for the timeouts and M. Note, however, 
that choosing M £ requires that T2+T3 € w(l), result- 
ing in a superlinear stabilization time. More precisely, the 
stabilization time of FATAL+ is, given M and minimizing 
the timeouts under this constraint, in Q{Mn). As mentioned 
previously, this limitation can be overcome by employing a 
digital clock synchronization such as ll22l . 

We now prove the correctness of the FATAL'*" protocol. 

Theorem 6.2: Let W C V, where \W\ > n — f, and 



define T{k), for fc e N, as in Corollary 4.16 Then, for 
any fc G N, the FATAL+ protocol is a (M^, W^)-stabiHzing 
pulse synchronization protocol (where accept^ is the "pulse" 
state) with skew S+ and accuracy bounds [T^ + T^)/-d — 
E+ and Tj^ + T:^ + + Brf+^x- It stabilizes within time 
T[k) + r+ + T3+ + E+ + 3d + 3d+a^x with probability at 
least 1 — 2^'^'""^-'. Moreover, the cycle counters increase 
by exactly one mod M at each pulse, within a time window 
of S+, and both the quick cycle state machine and the cycle 
counters are metastability-free once the protocol stabiUzed 
and remains fault-free in W . 

Proof: Assume that nodes \n W Q V, where \W\ > 
n— f, are non-faulty and channels between them are correct 
during [t-,t+], wher e ^+ > ^~+r(fc )+r++r++S++3d+ 
^f^iax- According to Corollary 4.16 with probability at least 
l-2-''^"'-f \ there exists a time to G [t- , t- + T (k))] such 
that all nodes in W switch to accept within [to, to + 2c?), 
and they will continue to switch to accept regularly in a 
synchronized fashion until at least t"*". For the remainder of 
the proof, we assume that such a time to is given; from here 
we reason deterministically. 



The skew bound is shown by induction on the fc-th 
consecutive quick cycle pulse, where fc e N, generated 
after the stabilization time to of the FATAL algorithm. Note 
that the time for which we are going to establish that the 
compound algorithm stabilizes is <i > to; here we denote for 
fc > 1 by tf; the time when the first node from W switches 
to accept^ for the A:*'* time after to + 3(i, i.e., the beginning 
of the k*-^ pulse of FATAL+ that we prove correct. W.l.o.g. 
we assume that ~ co; otherwise, all statements will be 
satisfied until t+ only (which is sufficient). 

To prove the theorem, we are going to show by induction 
on A: G N that 

(i) ti e [to + {T+ + T+)/d., to + T+ + T+ + S+ + 3d + 

(ii) if fc > 2, tfc < tk-i + T+ + T+ + E+ + 3d+,„ 

(iii) if fc > 2, tfc > tk-i + (r+ + T+)/d, 

(iv) \/i <E W : i switches to accept^ for the fc*'* time after 
to + 3d during [tk,tk + S+), 

(v) if fc > M, for / := [fc/Mj, Vi e : z switches to 
accept for the l^^ time after to + 2>d during [i^//, ^m/ + 
S+ + 2d), 

(vi) Vi e T4^ : node i's cycle counter switches from fc — 
1 mod M to fc mod M at some time from [ij, . ij, + S+), 
and 

(vii) if fc > 2, Vi e : node i's cycle counter changes its 
state exactly once during [tk-i-tk). 

In particular, the protocol is a pulse synchronization protocol 
with the claimed bounds on skew, accuracy, and stabilization 
time. Proving these properties will also reveal that quick 
cycle is metastability-free after time ti. 

To anchor the induction at fc = 1, we need to establish 
Statement (i) as well as Statements (iv) and (vi) for fc = 1; 
the remaining statements are empty for fc = 1. 

Recall that any node i G W switches to accept during 
[to, to + 2d). Hence, during 



to + 3d, to + 



T+ 



[to + 3d, to + 3d + 3d+a^^), 



at no node in W, {T^ , accept) is expired, implying that all 
nodes in W are in state accept^ during [to+3d+2d+jj^, to + 
3d+3d+jj^). Note that each node will reset its cycle counter 
to when {T2 , accept) expires, i.e., after having completed 
its transition to accept~^. 

The above bound shows that at the minimal time after to + 
3d when a node in W switches to ready^ , it is guaranteed 
that no node is observed in propose^ until the minimal time 
> ^0 + 3d when a node in W switches to propose^. 
Moreover, at any node switching to state ready^ timeout 
{T2 , accept) must be expired, implying that the node may 
not switch from ready^ to propose^ due to this signal until 
it switches to accept again. Recall that nodes set their NEXT 
signals to 1 only briefly when their cycle counters are set to 
0. Hence, for each such node in W, this signal is observed in 



state from the time when {T2 , accept) expires until (a) at 
least time tjy/ or (b) the time the node is forced by a switch to 
accept to set its counter to 0, whatever is earlier. Examining 
the main state machine, it thus can be easily verified that no 
node in W may switch from ready^ to propose"^ because 
{T2 , accept) — before (a) time t^ or (b) time 

> to 



to 



1} 



M{T+ + T3* 



3d^ 



3d (34) 



is reached. We obtain: 

(PI) No node in W observes {T^ , accept) (t) = at some 
time t e [to + 3d,min{tA./,to + M(r++T3++3d+a^)+3d}] 
when it is not in state accept^. 

Considering that any node i G W will switch to ready~^ 
once both and T,^ expired and subsequently to propose^ 
at the latest when expires (provided that it does not 
switch back to accepf^ first), it follows that by time 



to + 3d 



-n 



+ 3<ax + 3d (35) 

(36) 



each node in W must have been observed in propose^ at 
least once. On the other hand, as we established that nodes 
do not observe nodes in W in state propose^ when switching 
to ready^ at or after time t + 3d before the first node in W 
switches to propose'^, it follows that until time 



tn 



m 

Tto 



3d + T+ + 2d^ 



(37) 



nodes in W will have at most | V\Vl^| < / of their pmpose~^ 
flags in state 1, and their timeout did not expire yet. 
Thus, by (PI), the first node in W that switches to propose^ 
after to + 3d must do so at time tp > to + 3d + + 2d^^^^. 

Recall that ti is the minimal time larger than to + 3d when 
a node in W switches to state accept~^. By ( (35] l and since 
[W[ > n — f, we have that each node in W observes at least 
n~f nodes in propose^ by time to+r]^+T3'" + 3d+3d+3^^, 
and thus 



tl<to 



T++T+ 



3d + 3di:„ 



Moreover, we can trivially bound 



ti >tp > to 



{T+ + T+)ld 



(38) 



(39) 



From ( |38| l and ^i9) it follows that ti satisfies Statement (i) 
of the claim. 

Since at time ti there is a node i E W switching from 
propose^ to accept^, (PI) implies that it must memorize at 
least n — 2/ > / + 1 nodes in W in state propose'^ , which 
must have switched to this state during [tp,ti — d^J^j^J. By 
the above considerations regarding the reset of the propose'^ 
flags, this yields that all nodes in W will memorize at least 
/ + 1 nodes in state propose^ by time ti + d+g^ — d^J^jj^ and 



thus switch to propose^ (if they have not done so yet). It 
follows that by time ti + 2(i+3^^ — d^^^^^ = + S+, all nodes 
in W memorize at least \W\ > n — f nodes in propose^ 
and therefore switched to accept^ . Hence, we successfully 
established Statement (iv) of the claim for fc = 1. Statement 
(vi) follows for fc = 1, as the cycle counters have been reset 
to zero at the expiration of {T^, accept) and are increased 
upon the subsequent state transition to accept^ . Note that 
Statements (ii), (iii), (v), and (vii) trivially hold. 

We now perform the induction step from fc e N to fc + 
1. Assume that Statements (ii) to (vii) hold for all values 
smaller or equal to fc; Statement (i) only applies to fc = 1 
and was already shown. Define I :— [fc/AfJ > 0. Thus, if 
we can show Statement (ii) for fc + 1, we may infer that 



tk+l 

^^^t^Mi + (fc + 1 - Ml) {T+ +T+ + J: , 



< tMi + M{T+ + T+ + + 



3d 



tMl 



T2 + Ti 



) + 3d 
(40) 

(41) 



In case / = 0, it holds that fc < M and we may deduce (PI) 
by the same arguments as in the induction basis. 

In case / > 1, we use Statement (v) for value fc, and, 
by analogous arguments as in the induction basis, deduce 
that at no node in W, {T2 , accept) is expired during 
[tMl + ^d,tMi + 3d + 2d+jj^), implying that all nodes in 
W are in accept^ during that time. Repeating the reasoning 
of the induction basis before (PI) with to replaced by tnji, 
ti replaced by tf^, and tjv/ replaced by Imi+m shows that: 

(PT) No node in W observes (T^ , accept) {t) = at 
some time t g [t a/; + 3d, mm{tMi+M , tMi + M {T^ + + 
3d+a^x) + 3d}] when it is not in state accept~^. 

Since further Imi+m > tk+i by definition of I, we obtain 
from (PT) that no node i E W will memorize NEXT, = 1 
earlier than time mm{tk+i, tui + M{T^ + + 3d+ax) + 
3d} (again by reasoning analogously to the induction base). 

By Statement (iv) for the value fc, we know that each 
node i G W switches to accept^ during [tfe,tfc + E^). In 
particular, i will increase its cycle counter at the respective 
time, i.e.. Statement (vi) for fc + 1 follows at once if we 
establish Statement (vii) for fc + 1. As Statement (iv) for the 
value fc together with Statement (ii) for value fc + 1 imply 
that each node switches to accept^ exactly once during 
[ifc, tk+i). Statement (vii) for fc+1 follows, provided that we 
can exclude that the counter is reset to 0, due to {T2 , accept) 
expiring, at a time when it holds a non-zero value. 

We now show that this never happens. By Statement (v) 
for value fc each node i £W switches to accept during 

[iMi,iMi + S+ + 2d) (42) 

and this time is unique during [tMi^ik+i) due to (|40l). 



Because of ( |42] i a node in W will reset its timeout 
[T2 , accept) during 

[iM/,U// + S++3d) , (43) 

and {T2 , accept) will expire within 



T+ 

tMl + -jf.tMi + T+ + S+ + 3d + dl 
T+ T+ 

tMl H , tMl H TT 

V V 



[tMl + 3d + 2d+3^^, tMi+i) 



Thus, no node in W leaves state accept^ after switching 
there for the [Miy^ time after + 3d before observing 
that {T2 , accept) is reset and expires again. In particular, 
this shows that the counters are only reset to at times 
when they are anyway. Granted that Statement (ii) holds 
for fc + 1. Statement (vii) for fc + 1 follows. 

Next, we establish Statements (ii) to (iv) for fc + 1. We 
reason analogously to the case of fc = 1, except that we 
have to revisit the conditions under which state accept^ is 
left. As we have just seen, nodes switch from accept^ to 
ready^ upon expiring. Thus, as all nodes in W switch 
to accept^ during [t^, tk+T.^), they switch to ready^ within 
the time window [tk + r+/i?, tk + T+ + E+ + d+^^). By 
time 



tk 



tk 



+d+ 

' ^max ' 



all nodes in W will be observed in accept^ (and therefore 
not in propose^), together with (PI') preventing that the first 
node in W that (directly) switches from ready^ to propose'^ 
afterwards does so without expiring first. 

More precisely, according to (PI') no node in W observes 
{T2 ; accept) to be zero until time minjifc^.! , tMi+M {T^ + 
T3++3d+3^^) + 3d}. As showing Statement (ii) for fc + 1 will 
imply [inequality (|40|i[ we can w.l.o.g. disregard the case that 
tMl + M [T^ + + 3d+a^) + 3d < tk+i in the following. 
Thus, each node i E W will be observed in state propose^ 
no later than time tk + + T3+ + 11+ + 3d+jj^. As argued 
for fc 1, it follows that indeed tk+i < tk + + + 
+ 3d+jjx' i-S-' Statement (ii) for fc + 1 holds. Further 
each node i switches to accept^ for the (fc + 1)*^ time after 
to + 3d during time [tk+i, tk+i + i.e.. Statements (iv) 
for fc + 1 holds. Statement (iii) for fc + 1 is deduced from the 
fact that it takes at least {T^ + T^)l-& time until the first 
node from W switching to propose^ after tk + S+ does so, 
since timeouts and need to be reset and expire first, 
one after the other 

Finally, we need to establish Statement (v) for fc + 1. If 
M does not divide fc + 1. Statement (v) for fc + 1 follows 
from Statement (v) for fc. Otherwise M does divide fc + 1 



and we can bouncP^ 



m — 1, it is halted until the next pulse. We demand that 
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> tMi H 
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As by Statement (v) for fc all nodes in W switched to accept 
during [tMiitMi + + 2d), we conclud^^ from the main 
state machines' description that all nodes in W are observed 
in state ready with timeout Tr^ being expired (or already 
switched to propose or even accept) by time Imi + T2 + 
T3 + S+ + 4c?. Because all their NEXT signals switch to one 
during [tk+i, ifc+i+S^), all nodes in W must therefore have 
switched to propose by time t^+i Consequently]^ 
as we stated that w.l.o.g. they do not switch to accept again 
before time tk+i, they do so at times in [tk+i, tk+i + S+ + 
2d) as claimed. 

This completes the induction. According to Statement (i), 
ti satisfies the claimed bound on the stabilization time. With 
respect to this time. Statement (iv) provides the skew bound, 
and combining it with Statements (ii) and (iii), respectively, 
yields the stated accuracy bounds. Statements (vi) and (vii) 
show the properties of the counters. Metastability-freedom 
of the state machine is trivially guaranteed by the fact that 
each state has a unique successor state. For the counter, 
we can infer metastability-freedom after stabilization from 
the observation made in the proof that for times t > ti, 
(T2 , accept) (t) = at a non-faulty node implies that it is 
in state accept^ with its cycle counter equal to zero. This 
completes the proof. ■ 

For some applications, one might require an even higher 
operational frequency than provided by the quick cycle state 
machine. It turns out that there is a simple solution to this 



Increasing the Frequency Further 

Given any pulse synchronization protocol, one can derive 
clocks operating at an arbitrarily large frequency as follows. 
Whenever a pulse is triggered locally, the nodes start to 
increase a local integer counter modulo some value to G N 
at a speed of e times that of a local clock, starting 
from 0. Denote by the accuracy lower bound of the 
protocol and suppose that the local clock controlling the 
counter runs at a speed between 1 and p e (1,^?], i.e., its 
maximum drift is p— l|^Once the counter reaches the value 

^^Note that we already build on Statement (iii) for fc + 1 here. 

^^This statement relies on the constraints on the main state machines' 
timeouts, which require that T2 expiring is the critical condition for 
switching to ready. 

-''For details, we refer to the analysis of the FATAL algorithm in the full 
paper. 

-^We introduce p since one might want to invest into a single, more 
accurate clock source per node in order to obtain smaller skews. 



m<(j)T~. (44) 

This approach is similar to the one presented in 1231 . 
enriched by addressing the problem of metastability. 

In the context of the FATAL+ protocol, we get the 
following result. 

Corollary 6.3: Adding a counter as described above to 
the FATAL+ protocol and concatenating the counter values 
of the two counters at node i ^ V yields a bounded 
logical clock Li G {0, . . . , niM — 1}. At any time t when 
the protocol has stabilized on some set W (according to 
Theorem 6.2\ , it holds for any two nodes i,j e W that 

1 



\L,{t) - Lj{t) modTOMi < 



1 - - 



Once stabilized, these clocks do not "jump", i.e., they always 
increase by exactly one mod mM, with at least 1/p time 
between any two consecutive "ticks". 

The amortized clock frequency is within the bounds 
m/(r+ + T+ + S+ + 3d+,J and dm/[T+ +T+-^+). 
Viewed as a state machine in our model, the clocks Li, 
where i e W, are metastability-free after stabilization. 

Proof: Observe that it takes at least m/{(f)p) time for 
one of the new counters to increase from to m. Since the 
counters are restarted at pulses, which are triggered locally 
at most S+ time apart, at the time when a "fast" node arrives 
at the value m, a "slow" node will have increased its clock 
by at least [to/p— . According to Inequality (|44]i slow 
nodes will be able to increase thek counters to to before the 
next pulse. The claimed bound on the clock skew and the 
facts that clock increases are one by one and at most every 
1/p time follow. 

The bound on the amortized clock frequency follows by 
considering the minimal and maximal times M iterations of 
the quick cycle may require. 

The metastability-freedom of the clock is deduced from 
the metastability-freedom of the individual counters. For the 



new counter this is guaranteed by Inequality (44i since the 
counter is always halted at before it is reset due to a new 
quick cycle pulse. ■ 
We remark that in an implementation, one would probably 
utilize the better clock source, if available, to drive and 



as well]^ Maximizing to with respect to Inequality (|44 i 



and choosing + sufficiently large will thus result in 
clocks whose amortized drift is arbitrarily close to p, the 
drift of the underlying local clock source. 

VII. Implementation 

In this section, we provide an overview of our FPGA 
prototype implementation of the FATAL+ protocol. The 

^^Since the new counter is started together with , this does not incur 
metastability. Special handling is required for on the A/**^ pulse of the 
quick cycle, though. 



purposes of this implementation are (i) to serve as a proof 
of concept, (ii) to validate the predictions of the theoretical 
analysis, and (iii) to form a basis for the future development 
of protocol variants and engineering improvements. Rather 
than striving for optimizing performance, area or power 
efficiency, our primary goal is hence to essentially provide 
a direct mapping of the algorithmic description to hardware, 
and to evaluate its properties in various operating scenarios. 

Our implementation does not follow the usual design 
practice, for several reasons: 

Asynchrony: Targeting ultra-reliable clock generation in 
SoCs, the implementation of FATAL+ itself cannot rely on 
the availability of a synchronous clock. Moreover, some 
performance-critical guards, like the one of the transition 
from propose to accept in [Figure "2} are purely asynchronous 
and should hence not be synchronized to a local clock. 
Even worse, testing for activated guards synchronized to a 
local clock source bears the risk to generate metastability, 
as remote signals originate in different clock domains. On 
the other hand, conventional asynchronous state machines 



(ASM) are not well-suited for implementing Figure 2 



Figure 5 due to the possibility of choice of successor states 
and continuously enabled (i.e., non-alternating) guards. Our 
prototype relies on hybrid state machines (HSM) that com- 
bine an ASM with synchronous transition state machines 
(TSM) that are started on demand only. 

Fault tolerance: The presence of Byzantine faulty nodes 
forced us to abandon the classic "wait for all" paradigm 
traditionally used for enforcing the indication principle in 
asynchronous designs: Failures may easily inhibit the com- 
pletion of the request/acknowledge cycles typically used for 
transition-based flow control. Timing constraints, established 
by our theoretical analysis, in conjunction with state-based 
communication are resorted to in order to establish event 
ordering and synchronized executions in FATAL+. 

Self-Stabilization: In sharp contrast to non-stabilizing al- 
gorithms, which can always assume that there is a (sub- 
stantial) number of non-faulty nodes that run approximately 
synchronously and hence adhere to certain timing con- 
straints, self-stabilizing algorithms cannot even assume this. 
Although FATAL+ guarantees that non-faulty nodes will 
eventually execute synchronously, even when started from an 
arbitrary state, the violation of timing constraints and hence 
metastability [9| cannot be avoided during stabilization. For 



example, state accept in Figure 2 has two successors sleep 
and recover, the guards of which could become true arbitrar- 
ily close to each other in certain stabilizing scenarios. This 
is acceptable, though, as long as such problematic events 
are neither systematic nor frequent, which is ensured by the 
design and implementation of FATAL+ (see Section VII-A i. 



Inspecting Figure 2 Figure 5 reveals that the state tran- 
sitions of the FATAL+ state machines are triggered by 
AND/OR combinations of the following different types of 



conditions: 

(1) A watchdog timer expires {^\T2, accept)"}. 

(2) The state machines of a certain number (1, > / + 1, 
or > n — f) of nodes reached a particular (subset of) 
state(s) at least once since the reset of the corresponding 
memory flags [">«—/ accept"}. 

(3) The state machines of a certain number (!,>/ + 1, or 
> n — /) of nodes are currently in (one of) a particular 
(subset of) state(s) ["in resync"}. 

(4) Always ["true"]. 

The above requirements reveal the need for the following 
major building blocks: 

• Concurrent HSMs, implementing the states and transi- 
tions specified in the protocol. 

• Communication infrastructure between those state ma- 
chines, continuously conveying the state information. 

• Watchdog timers (also with random timeouts) for im- 
plementing type (1) guards. 

• Threshold modules and memory flags for implementing 
type (2) and type (3) guards. 

Obviously, all these building blocks require implementations 
that match the assumptions of the formal model in |Section II| 
Apart from maintaining timing assumptions like an end- 
to-end communication delay bound t — T~^{t) < d, this 
also includes the need to implement all stateful components 
in a self-stabilizing way: They must be able to eventually 
recover from an arbitrary erroneous internal state, including 
metastability, when operating in the specified environment. 

Before we proceed with a description of the implementa- 
tions of these components, we discuss how FATAL+ deals 
with the threat of metastability arising from our extreme 
fault scenarios. 

A. Metastability issues 

Reducing the potential for both metastability generation 
and metastability propagation are important goals in the 
design and implementation of FATAL+. Although it is 
impossible to completely rule out metastability generation in 
the presence of Byzantine faulty nodes (which may issue sig- 
nal transitions at arbitrary times) and during self-stabilization 
(where all nodes may be completely asynchronous), we 
nevertheless achieved the following properties: 
(I) Guaranteed metastability-freedom in fault-free execu- 
tions after stabilization. 
(II) Non-faulty nodes are safeguarded against "attacks" by 
faulty nodes that aim at inducing metastability, in 
particular once the system has stabilized. 

(III) Metastable upsets at non-faulty nodes are rare during 
stabilization, therefore delaying stabilization as little as 
possible. 

(IV) Very small windows of vulnerability and the possibility 
to incorporate additional measures for decreasing the 
upset probability further. 



The following approaches have been used in FATAL+ to 
accomplish these goals (additional details will be given in 
the subsequent sections): 

(I) is guaranteed by our proofs of metastability-freedom, 
which exploit the fact that all non-faulty nodes run ap- 
proximately synchronously after stabilization. It is hence 
relatively straightforward to ensure, via timing constraints, 
that some data from remote ASMs does not change while it 
is used. 

(II) is accomplished by several means, which make it 
very difficult (albeit not impossible) for a faulty node to 
generate/propagate metastability. Besides avoiding any ex- 
plicit control flow between ASMs by communicating states 
only, which greatly reduces the dependency of a non-faulty 
receiver node from a faulty sender, several forms of logical 
masking of metastability are employed. One example is the 
combination of memory flags and threshold gates, which en- 
sure that possibly upset memory flags are always overruled 
quickly by correct ones at the threshold output. A different 
form of logical masking occurs due to the fact that, after 
stabilization, all non-faulty nodes execute the outer cycle 



Sender 



of the main state machine (Figure 2i only: Since the outer 
cycle does not involve any type (3) guard once stabilization 
is achieved, any metastability originating from the (less 



metastability-safe) resynchronization algorithm (Figure 4i 



and its extension ( Figure 3 i is completely masked. 

To accomplish (III), the measures outlined in (2) are 
complemented by adding time masking using randomization: 



The resynchronization routine (Figure 4i tries to initialize 
recovery from arbitrary states at random, sufficiently sparse 
points in time. It is hence very unlikely that non-faulty 
nodes are kept from stabilizing due to metastable upsets. 
Moreover, if at the beginning of the stabilization process 
/' < / < n/3 nodes are faulty, up to / — /' metastable 
upsets can be tolerated without keeping the remaining nodes 
from stabilizing; the nodes that became subject to newly 
arising transient faults will stabilize quickly once n—f nodes 
established synchronization (cf. [Theorem 4.17} . 

Finally, (IV) is achieved by implementing all building 
blocks that are susceptible to metastable upsets, like memory 
flags, in a way that minimizes the window of vulnerability. 
Moreover, elastic pipelines acting as metastability filters |29| 
or synchronizers can be added easily to further protect such 
elements. 



B. State machine communication 

According to our system model, an HSM must be able to 
continuously communicate its current state system-wide: It 
is requested that every receiver is informed of the sender's 
current state within d time (resp. within d'^^^ and d^ax for 
the quick cycle algorithm). For simplicity, we use parallel 
communication, by means of a suitably sized data bus, in 
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Figure 6. Sender and single receiver (including memory flags) for the 
ASM of the main algorithm. 



our implementation]^ Since a node treats itself like any 
other node in type (2) and type (3) guards with thresholds, it 
comprises a complete receiver as described below for every 
node in the system (including itself). 



Figure 6 shows the circuitry used for communicating the 



current state of the main algorithm in Figure 2 The sender 
consists of a simple array of flip-flops, which drive the 
parallel data bus that thus continuously reflects the current 
state of the sender's HSM. In sharp contrast to handshake- 
based communication, reading at the receiver occurs without 



any coupling to the sender here. As argued in Section VII- A 



the synchrony between non-faulty nodes guaranteed by the 
FATAL+ protocol guarantees that the sender state data will 
always be stable when read after stabilization. For the 
stabilization phase, we cannot give such a guarantee but take 
some (acceptable) risk of metastability. 

To avoid the unacceptable risk of reading and capturing 
false intermediate sender states due to different delays on 
the wires of the data bus, delay-insensitive |431 state coding 
must be used. We have chosen the following encoding for 



the main state machine in Figure 2 



It is, however, reasonably easy to replace parallel communication by 
serial communication, e.g., by extending the (synchronous) TSM appropri- 
ately. 



propose 0000 

sleep 1011 

waking 0101 

recover 1100 



accept 
sleep - 
ready 
join 



waking 



1001 
0011 
0110 
1010 



TSMClock 



For the other state machines making up FATAL+, it suffices 
to communicate only a single bit of state information (supp 
init or wait in 



Figure 3 



Figure 4 and propose'^ 



Figure 5 1. Hence, every bus consists of a single 



or none m 
or none^ in 

wire here, and the decoder in the receiver becomes trivial. 

The receiver consists of a simple combinational decoder 
consisting of AND gates, which generate a l-out-of-m 
encoding of the binary representation of the state commu- 
nicated via the data bus. The decoded signals correspond to 
a single sender state each. This information is directly used 
for type (3) guards, and fed into memory flags for type (2) 
guards. Every memory flag is just an SR-latch with dominant 
reset, whose functional equivalents are also included in 
[Figure 6| Note that a memory flag is set depending on the 
state communicated by the sender, but (dominantly) cleared 
under the receiver's control. 

A memory flag may become metastable when the inputs 
change during stabilization of its feedback loop, which can 
occur due to (a) input glitches and/or (b) simultaneous falling 
transitions on both inputs. However, for correct receivers, (a) 
can only occur in case of a faulty sender, and (b) is again 
only possible during stabilization: Once non-faulty nodes 



execute the outer cycle of Figure 2 it is guaranteed that e.g 



all non-faulty nodes enter accept before the first one leaves. 
The probability of an upset is thus very small, and could 
be further reduced by means of an elastic pipeline acting 
as metastability filter (which must be accounted for in the 
delay bounds). 

The most straightforward implementation of the threshold 
modules used for generating the > / + 1 and > n — f 
thresholds in type (2) and type (3) guards is a simple sum- 
of-product network, which just builds the OR of all AND 
combinations of / + 1 resp. n — f inputs. In our FPGA 
implementation, a threshold module is built by means of 
lookup-tables (LUT); some dedicated experiments confirmed 
that they work glitch-free for monotonic inputs (as provided 
by the memory flags). 

C. Hybrid state machines 

Our prototype implementation of FATAL+ relies on hy- 
brid state machines (HSM): An ASM is used for determin- 
ing, by asynchronously evaluating the guards, the points 
in time when a state transition shall occur. Our ASMs 
have been built by deriving a state tra nsition graph (STG) 
specification directljPl from Figures 2 5 and generating the 



delay-insensitive implementation via Petrify |32|. The actual 

-'^Note that the STG specification had to be extended .slightly in order to 
transform our possibly non-alternating guards (which might be continuously 
enabled in some cycle, in particular during stabilization) into strictly 
alternating ones. 




Q' true 
Figure 7. Example state transition, including the corresponding TSM. 



State transition of an HSM is governed by an underlying 
synchronous transition state machine (TSM). The TSM 
resolves a possibly non-deterministic choice of the successor 
state and then performs the required transition actions: 

1) Reset of memory flags and watchdog timers 

2) Communication of the new state 

3) Actual transition to the new state (i.e., enabling of 
further transitions of the ASM) 

The TSM is driven by a pausible clock (see [Section VII-E)| , 
which is started dynamically by the ASM before the tran- 
sition. Note that this avoids the need for synchronization 
with a free-running clock and hence preserves the ASMs 
continuous time scale. 



The TSM works as follows (see Figure 7 1: Assume that 
the ASM is in state A, and that the guard G for the 
transition from A to B becomes true. In the absence of an 
inhibit signal (indicating that another transition is currently 
being taken, see below), the TSM clock is started. With 
every rising edge of TSMClock, the TSM unconditionally 
moves through a sequence of three states: synchronize (Syn), 
commit (Cmt), and terminate (Trm) shown in the rectangular 
box in [Figure 7] In Syn, the inhibit signal is activated to 
prevent other choices from being executed in case of more 
than one guard becoming true. Whereas any ambiguity can 
easily be resolved via some priority rule, metastability due to 
(a) enabled guards that become immediately disabled again 
or (b) new guards that are enabled close to transition time 
cannot be ruled out in general here. However, as argued in 
Section VII-A (a) could only do harm to FATAL+ during 
stabilization, due to type (3) guards; recall that type (1) and 
type (2) guards are always monotonic, with the reset (of 
watchdog timers and memory flags) being under the control 
of the local state machine. Similarly, our proofs reveal that 
upsets due to (b) are fully masked after stabilization. Thus, 
after stabilization, metastability of the TSM can only occur 
due to unstable inputs, i.e., upsets in memory flags. Given 
the small window of vulnerability of the synchronizing stage 
for Syn, the resulting very low probability of a metastable 
upset is considered acceptable. 

Once the TSM has reached Syn, it has decided to actually 
take the transition to B and hence moves on to state Cmt. 
Here the watchdog timer associated with B and possibly 




Figure 8. Pausible ring oscillator implementing the TSM clock. 



some memory flags are cleared according to the FATAL+ 
state machine, and the new state B is captured by the 
output flip-flops driving the state communication data bus 



(recall Section VII-B i. Note that the resulting delay must 



be accounted for in the communication delay bounds d, 
d+ax ^nd dj^jjjj. Finally, the TSM moves on to state Tnn, in 
which the reset signals are inactivated again and the TSM 
clock is halted. The inhibit signal is also cleared here, which 
effectively moves the ASM to state B. It is only now that 
guards pertaining to state B may become true. 

D. Pausible oscillator 

The TSM clock is an asynchronously startable and syn- 
chronously stoppable ring oscillator, which provides a clock 
signal TSMClock that is LOW when the clock is stopped 
via an input signal TSMCStop. Note that copies of this 
oscillator are used for driving the watchdog timers presented 
in [Section VTl-B 



The operation of the TSM clock circuit shown in Figure 8 
is straightforward: In its initial state, TSMCStop=HlGli 
and the Muller C-gate has HIGH at its output, such that 
TSMClock=LOW . Note that the circuit also stabilizes to 
this state if the Muller C-gate was erroneously initialized 
to LOW, as the ring oscillator would eventually generate 
TSMClock=HlGH, enforcing the correct initial value HIGH 
of the C-gate. 

When the ASM requests a state transition, at some arbi- 
trary time when a transition guard became true, it just sets 
TSMCStop=LOW. This starts the TSM clock and produces 
the first rising edge of TSMClock half a clock cycle time 
later. As long as TSMCStop remains LOW, the ring oscillator 
runs freely. 

The frequency of the ring oscillator is primarily deter- 
mined by the (odd) number of inverters in the feedback 
loop|f] It varies heavily with the operating conditions, in 
particular with supply voltage and temperature: The result- 
ing (two-sided) clock drift ^ is typically in the range of 
7% . . . 9% for uncompensated ring oscillators like ours; in 
ASICs, it could be lowered down to 1% . . . 2% by special 
compensation techniques |44|. Note that the two-sided clock 
drifts map to ?9 = (1 + C)/(l - C) bounds of 1.15 . . . 1.19 
and 1.02 .. . 1.04, respectively. 

The stopping of TSMClock is regularly initiated by the 
TSM itself: With the rising edge of TSMClock that moves the 
TSM into Trm, TSMCStop is set to HIGH. Since TSMClock 

-'in our FPGA implementation, the oscillator frequency is so high that 
we also employ a frequency divider at the output. 



is also HIGH after the rising edge|^ the C-gate output is 
also forced to HIGH. Hence, after having finished the half 
period of this final clock cycle, the feedback loop is frozen 
and TSMClock remains LOW. 



For metastability-free operation of the C-gate in Figure 8 



(a) the falling transition of TSMCStop must not occur simul- 
taneously with a rising edge of TSMClock, and (b) the rising 
transition of TSMCStop must not occur simultaneously with 
the falling edge of TSMClock. (a) is guaranteed by stopping 
the clock in state Trm of the TSM, since the output of the 
C-gate is permanently forced to HIGH on this occasion; 
TSMClock cannot hence generate a rising transition before 
TSMCStop goes to LOW again. Whereas this synchronous 
stopping normally also ensures (b), we cannot always rule 
out the possibility of getting TSMCStop=HlGli close to the 
first rising edge of TSMClock: (b) could thus occur due to 
prematurely disabled type (3) guards, which we discussed 
already with respect to their potential to create metastability 



in the TSM, recall Section VII-C Besides being a rare event. 



this can only do harm during stabilization, however. 
E. Watchdog Timers 



Recall that every ASM state, except for accept in Figure 2 



is associated with at most one watchdog timer required for 
type (1) guards; accept is assoc iated with three timers (for 
Ti and T2 as well as for in Figure 5 1. A timer is reset 
by the TSM when its associated state is entered. 



According to Figure 9 every watchdog timer consists of 
a synchronous resettable up-counter that is clocked by some 
oscillator, and a timeout register that holds the timeout value. 
A comparator raises an output signal if the counter value 
is greater or equal to the register value. An SR latch with 
dominant reset memorizes this expired condition until the 
timer is re-triggered. 
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Figure 9. Implementation principle of watchdog timers. 

Like the TSM, timers are driven by pausible oscillators, 
which are started by the TSM after resetting the timer 
and stopped synchronously upon timer expiration. Note that 
every timer (except for the multiple accept timers, which 
share a common oscillator that is stopped when the largest 
timeout expires) is provided with a dedicated oscillator in 
our implementation for simplicity. This not only avoids 

^"obviously, we only have to take care in the timing analysis that setting 
r5MC5top=HIGH occurs well within the first half period. 



quantization errors in the continuous timing of the ASM 
state transitions, but is also mandatory for avoiding the po- 
tential of metastability due to timer resets colliding with the 
transitions of a free-running clock. In our implementation, 
the timer reset takes place in TSM state Cmt, while the 
oscillator is started in state Tnn. This well ordered sequence 
rules out all metastability issues. 

As for the watchdog timer with random timeout R3 
our implementation uses an linear feedback 



Figure 4 



shift register (LFSR) clocked by a dedicated oscillator: 
A uniformly distributed random value, sampled from the 
LFSR, is loaded into the timeout register whenever the 
watchdog timer is re-triggered. Note that for many settings, it 
is reasonable to assume that the new random value remains 
a secret until the timeout expires, as it is not read or in 
any other way considered by the node until then. As our 
prototype implementation is not meant for studying security 
issues, this simple implementation is thus sufficient. 

VIII. Experimental Evaluation 

Our prototype implementation has been written in VHDL 
and compiled for an Altera Cyclone IV FPGA using the 
Quartus tool. 

Apart from standard functional and timing verification 
via Modelsim, we conducted some preliminary experiments 
for verifying the assumed properties (glitch-freeness, mono- 
tonicity, etc.) of the synthetisized implementations of our 
core building blocks: Since FPGAs do not natively provide 
the basic elements required for asynchronous designs, and 
we have no control over the actual mapping of functions 
to the available LUTs (e.g. our threshold modules are 
implemented via LUT instead of the intended combinational 
AND-OR networks), we had to make sure that properties 
that hold naturally in "real" asynchronous implementations 
also hold here. Backed up by the (positive) results of these 
experiments, a complete system consisting of n 4 resp. 
n = 8 nodes (tolerating at most / = 1 resp. f — 2 Byzantine 
faulty nodes) has been built and verified to work as expected; 
overall, they consume 23000 resp. 55000 logic blocks. Note 
however, that both designs include the test environments 
which makes up a significant part of the designs. 

To facilitate systematic experiments, we also developed a 
custom test bench that provides the following functionality: 

(1) Measurement of pulse frequency and skew at different 
nodes. 

(2) Continuous monitoring of the potential of non-deter- 
ministic HSM state transitions. 

(3) Starting the entire system from an arbitrary state 
(including memory flags and timers), both determin- 
istically and randomly chosen. 

(4) Resetting a single node to some initial state, at arbi- 
trary times. 

(5) Varying the clock frequency of any oscillator, at arbi- 
trary times. 



(6) Varying the communication delay between any pairs 
of sender and receiver, at arbitrary times. 
All these experiments can be done with and without up to / 
(actually, / + 1 to also include excessively many) Byzantine 
nodes. To this end, the HSMs of at most / + 1 = 3 nodes 
can be replaced by special devices that allow to (possibly 
inconsistently) communicate, via the communication data 
buses, any HSM state to any receiver HSM at any time. 

(1) is accomplished using standard measurement equip- 
ment (logic analyzer, oscilloscope, frequency counter) at- 
tached to the appropriate signals routed via output pins. (2) 
is implemented by memorizing any event where more than 
one guard is enabled when the TSM performs its first state 
transition, in a flag that can be externally monitored. 

(3) is realized by adding a scan-chain to the imple- 
mentation, which allows to serially shift-in arbitrary initial 
system states at run-time. Repeated random experiments are 
controlled via a Python script executed at a PC workstation, 
which is connected via USB to an ATMega 16 micro- 
controller (uC) that acts as a scan-controller towards the 
FPGA: The uC takes a bit-stream representing an initial 
configuration, sends it to the FPGA via the serial scan- 
chain interface, and signals the FPGA to start execution of 
FATAL+. When the system has stabilized, the uC informs 
the Python script which records the stabilization time and 
proceeds with sending the next initial configuration. 

To enable (4)-(6), the testbench provides a global high- 
resolution clock that can be used for triggering mode 
changes. To ensure its synchrony w.rt. the various node 
clocks, all start/stoppable ring oscillators are replaced by 
start/stoppable oscillators that derive their output from the 
global clock signal. (4) is achieved by just forcing a node 
to reset to its initial state for this run at any time during 
the current execution. In order to facilitate (5), dividers 
combined with clock multipliers (PLLs) are used: For any 
oscillator, it is possible to choose one of five different 
frequencies (0, excessively slow, slow, fast, excessively fast) 
at any time. For (6), a variable delay line implemented as a 
synchronous shift register of length X E [0, 15], driven by 
the global clock, can be inserted in any data bus connecting 
different HSMs individually. 

In order to exercise also complex test scenarios in a re- 
producible way, a dedicated testbed execution state machine 
(TESM), driven by the global clock, is used to control the 
times and nodes when and where clock speeds, transmission 
delays and communicated fault states are changed and when 
a single node is reset throughout an execution of the system. 
Transition guards may involve global time and any combi- 
natorial expression of signals used in the implementation of 
FATAL+, i.e., any predicate on the current system statej^ 

^'To decrease the experiment setup time (after all, changing the TESM 
requires recompilation of the entire system), the TESM is gradually changed 
to also incorporate additional parameters and configuration information 
downloaded at run-time via the uC. 




Figure 10. FATAL and FATAL+ clocks: MainAlgState [ i ] =1 iff i is 
in accept, and FATAL+CLK [i] is i's FATAL+ signal. 




Figure 12. Head of distribution 
of stabilization times (in s) for over 
6500 randomly initialized 8-node 
instances. 
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Figure 13. Tail of distribution of 
stabilization times (in s) for over 
6500 randomly initialized 8-node 
instances. 
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Figure 1 1 . Worst-case skew scenario for FATAL clocks. 



Using our testbench, it was not too difficult to get our 
FATAL+ up and running. As expected, we spotted several 
hidden design errors that showed up during our experiments, 
but also some errors (like a missing factor of d in one 
of our timeouts due to a typo) in the initial version of 
our theoretical analysis, which caused deviations of the 
measured w.rt. the predicted performance. 

Finally, using the implementation parameters ■& = 1.3, 
d = ISr, djl^jjj — d^iax — 32^, where T is the experimental 
clock period T = 400ns, and minimal timeouts according 
to the constraints, we conducted the following experiments, 
observing the behavior of both, the FATAL+ as well as the 
underlying FATAL system: 

(A) Maximum skew scenarios, including effects of exces- 
sively small/fast clocks and message delays: The experimen- 
tal results confirmed the analytic predictions as being tight: 



As shown in Figure 10 pulses of the 8 node FATAL resp. 
FATAL+ system occur at a frequency of about 62Hz resp. 
lOkHz. Note that the quite low values for the frequency stem 
from the fact that we were intentionally slowing down the 
system in order to carry out our worst-case experiments. 

The figure further clearly demonstrates the capability of 
FATAL+ to generate pulses with significantly less skew 
(1/is) on top of the FATAL pulses. 

Further experiments, involving / = 2 Byzantine nodes, 
were used to produce a worst-case scenario for the FATAL 
skew (6/is). The resulting waveform is depicted in [Figure 1 1| 

(B) Scenarios leading to the potential of non-deterministic 



HSM state transitions in the absence of Byzantine nodes 
(which would invalidate our proof of metastability-freedom 
if happening after stabilization): We run 17000 experiments, 
in each of which the 8 node system was set up with randomly 
chosen message delays between nodes and random clock 
speeds and stabilized from random initial states. Within 
10 seconds from stabilization on, not a single upset was 
encountered in any instance. 

(C) Stabilization of an 8-node system from random initial 
states, with randomly chosen clock speeds and message 
delays (without Byzantine nodes). Over 4000 runs have been 
performed. A considerable fraction of the setups stabilizes 
within less than 0.035 seconds, which can be credited to the 
fast stabilization mechanism intended for individual nodes 



resynchronizing to a running system (see Figure 12 and 



Figure 13 i. The remaining runs stabilize, supported by the 
resynchronization routine, in less than 10 seconds, which is 
less than the system's upper bound on T(l). Note that the 
stabilization time is inversely proportional to the frequency, 
i.e., in a system that is not slowed down stabilization is 
orders of magnitude faster. 

IX. Conclusions 

We conclude with a few considerations regarding the 
asymptotic complexity of implementations of FATALh- and 
future work. The algorithm has the favorable property that 
nodes broadcast a constant number of bits in constant time, 
which clearly is optimal. While it would be beneficial to 
reduce node degrees, this must come at the price of reducing 
the resilience to faults |19|, |20|. In terms of the number of 
Byzantine faults the algorithm can sustain in relation to node 
degrees, the algorithm is asymptotically optimal as well. 
It is subject to future work to extend the algorithm to be 
applicable to networks of lower degree in a way preserving 
resilience to a (local) number of faults that is optimal in 
terms of connectivity. 

Furthermore, it is not difficult to see that except for 
the threshold modules, each node comprises a number of 
basic components that is linear in n (cf. fJTl, where similar 
building blocks were used). In an ASIC implementation. 



one could implement the threshold modules by sorting 
networks, resulting in a latency of 0{\ogn) and a gate 
complexity of O{n\ogn) [45 1. Clearly, it is necessary to 
have conditions involving more than / nodes in order to 
overcome / Byzantine faults. Hence, assuming constant fan- 
in of the gates, both the current and envisioned solutions are 
asymptotically optimal with respect to latency. Optimality of 
an implementation relying on sorting networks with respect 
to gate complexity is not immediate, however there is at 
most a logarithmic gap to the trivial lower bound of ri(n). 
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